1 00:00:08,407 --> 00:00:10,321 - Okay, sounds like it is. 2 00:00:10,321 --> 00:00:12,450 I'll be telling you about adversarial examples 3 00:00:12,450 --> 00:00:15,292 and adversarial training today. 4 00:00:15,292 --> 00:00:16,125 Thank you. 5 00:00:18,100 --> 00:00:20,606 As an overview, I will start off by telling you 6 00:00:20,606 --> 00:00:22,871 what adversarial examples are, 7 00:00:22,871 --> 00:00:26,017 and then I'll explain why they happen, 8 00:00:26,017 --> 00:00:28,670 why it's possible for them to exist. 9 00:00:28,670 --> 00:00:31,026 I'll talk a little bit about how adversarial examples 10 00:00:31,026 --> 00:00:33,580 pose real world security threats, 11 00:00:33,580 --> 00:00:36,248 that they can actually be used to compromise 12 00:00:36,248 --> 00:00:38,514 systems built on machine learning. 13 00:00:38,514 --> 00:00:41,302 I'll tell you what the defenses are so far, 14 00:00:41,302 --> 00:00:43,986 but mostly defenses are an open research problem 15 00:00:43,986 --> 00:00:47,586 that I hope some of you will move on to tackle. 16 00:00:47,586 --> 00:00:49,075 And then finally I'll tell you 17 00:00:49,075 --> 00:00:50,644 how to use adversarial examples 18 00:00:50,644 --> 00:00:53,156 to improve other machine learning algorithms 19 00:00:53,156 --> 00:00:56,020 even if you want to build a machine learning algorithm 20 00:00:56,020 --> 00:00:59,270 that won't face a real world adversary. 21 00:01:00,989 --> 00:01:05,272 Looking at the big picture and the context for this lecture, 22 00:01:05,272 --> 00:01:07,511 I think most of you are probably here 23 00:01:07,511 --> 00:01:10,390 because you've heard how incredibly powerful 24 00:01:10,390 --> 00:01:12,692 and successful machine learning is, 25 00:01:12,692 --> 00:01:14,478 that very many different tasks 26 00:01:14,478 --> 00:01:17,130 that could not be solved with software before 27 00:01:17,130 --> 00:01:20,188 are now solvable thanks to deep learning 28 00:01:20,188 --> 00:01:23,785 and convolutional networks and gradient descent. 29 00:01:23,785 --> 00:01:27,138 All of these technologies that are working really well. 30 00:01:27,138 --> 00:01:28,661 Until just a few years ago, 31 00:01:28,661 --> 00:01:30,988 these technologies didn't really work. 32 00:01:30,988 --> 00:01:33,868 In about 2013, we started to see 33 00:01:33,868 --> 00:01:37,036 that deep learning achieved human level performance 34 00:01:37,036 --> 00:01:39,018 at a lot of different tasks. 35 00:01:39,018 --> 00:01:40,993 We saw that convolutional nets 36 00:01:40,993 --> 00:01:43,228 could recognize objects and images 37 00:01:43,228 --> 00:01:47,165 and score about the same as people in those benchmarks, 38 00:01:47,165 --> 00:01:49,638 with the caveat that part of the reason that 39 00:01:49,638 --> 00:01:51,306 algorithms score as well as people 40 00:01:51,306 --> 00:01:52,761 is that people can't tell 41 00:01:52,761 --> 00:01:55,410 Alaskan Huskies from Siberian Huskies very well, 42 00:01:55,410 --> 00:01:58,559 but modulo the strangeness of the benchmarks 43 00:01:58,559 --> 00:02:01,781 deep learning caught up to about human level performance 44 00:02:01,781 --> 00:02:05,243 for object recognition in about 2013. 45 00:02:05,243 --> 00:02:08,547 That same year, we also saw that object recognition 46 00:02:08,547 --> 00:02:12,458 applied to human faces caught up to about human level. 47 00:02:12,458 --> 00:02:14,709 That suddenly we had computers 48 00:02:14,709 --> 00:02:17,874 that could recognize faces about as well as 49 00:02:17,874 --> 00:02:21,728 you or I could recognize faces of strangers. 50 00:02:21,728 --> 00:02:24,642 You can recognize the faces of your friends and family 51 00:02:24,642 --> 00:02:27,537 better than a computer, but when you're dealing 52 00:02:27,537 --> 00:02:30,152 with people that you haven't had a lot of experience with 53 00:02:30,152 --> 00:02:34,306 the computer caught up to us in about 2013. 54 00:02:34,306 --> 00:02:36,108 We also saw that computers caught up 55 00:02:36,108 --> 00:02:40,275 to humans for reading type written fonts in photos 56 00:02:41,183 --> 00:02:42,987 in about 2013. 57 00:02:42,987 --> 00:02:46,401 It even got the point that we could no longer use CAPTCHAs 58 00:02:46,401 --> 00:02:50,634 to tell whether a user of a webpage is human or not 59 00:02:50,634 --> 00:02:52,439 because the convolutional network 60 00:02:52,439 --> 00:02:56,496 is better at reading obfuscated text than a human is. 61 00:02:56,496 --> 00:02:58,406 So with this context today 62 00:02:58,406 --> 00:03:00,095 of deep learning working really well 63 00:03:00,095 --> 00:03:02,019 especially for computer vision 64 00:03:02,019 --> 00:03:05,136 it's a little bit unusual to think about 65 00:03:05,136 --> 00:03:07,800 the computer making a mistake. 66 00:03:07,800 --> 00:03:10,409 Before about 2013, nobody was ever surprised 67 00:03:10,409 --> 00:03:12,250 if the computer made a mistake. 68 00:03:12,250 --> 00:03:14,659 That was the rule not the exception, 69 00:03:14,659 --> 00:03:16,767 and so today's topic which is all about 70 00:03:16,767 --> 00:03:20,132 unusual mistakes that deep learning algorithms make 71 00:03:20,132 --> 00:03:24,000 this topic wasn't really a serious avenue of study 72 00:03:24,000 --> 00:03:28,099 until the algorithms started to work well most of the time, 73 00:03:28,099 --> 00:03:31,555 and now people study the way that they break 74 00:03:31,555 --> 00:03:36,412 now that that's actually the exception rather than the rule. 75 00:03:36,412 --> 00:03:39,168 An adversarial example is an example 76 00:03:39,168 --> 00:03:43,382 that has been carefully computed to be misclassified. 77 00:03:43,382 --> 00:03:45,864 In a lot of cases we're able to make the new image 78 00:03:45,864 --> 00:03:48,331 indistinguishable to a human observer 79 00:03:48,331 --> 00:03:50,226 from the original image. 80 00:03:50,226 --> 00:03:52,833 Here, I show you one where we start with a panda. 81 00:03:52,833 --> 00:03:54,528 On the left this is a panda 82 00:03:54,528 --> 00:03:57,297 that has not been modified in any way, 83 00:03:57,297 --> 00:03:59,855 and the convolutional network trained on the image 84 00:03:59,855 --> 00:04:03,849 in that dataset is able to recognize it as being a panda. 85 00:04:03,849 --> 00:04:05,524 One interesting thing is that the model 86 00:04:05,524 --> 00:04:08,064 doesn't have a whole lot of confidence in that decision. 87 00:04:08,064 --> 00:04:10,656 It assigns about 60% probability 88 00:04:10,656 --> 00:04:13,411 to this image being a panda. 89 00:04:13,411 --> 00:04:16,055 If we then compute exactly the way 90 00:04:16,055 --> 00:04:17,947 that we could modify the image 91 00:04:17,947 --> 00:04:20,624 to cause the convolutional network to make a mistake 92 00:04:20,624 --> 00:04:23,006 we find that the optimal direction 93 00:04:23,006 --> 00:04:27,017 to move all the pixels is given by this image in the middle. 94 00:04:27,017 --> 00:04:29,625 To a human it looks a lot like noise. 95 00:04:29,625 --> 00:04:31,176 It's not actually noise. 96 00:04:31,176 --> 00:04:33,244 It's carefully computed as a function 97 00:04:33,244 --> 00:04:34,883 of the parameters of the network. 98 00:04:34,883 --> 00:04:36,880 There's actually a lot of structure there. 99 00:04:36,880 --> 00:04:41,053 If we multiply that image of the structured attack 100 00:04:41,053 --> 00:04:45,373 by a very small coefficient and add it to the original panda 101 00:04:45,373 --> 00:04:48,131 we get an image that a human can't tell 102 00:04:48,131 --> 00:04:49,806 from the original panda. 103 00:04:49,806 --> 00:04:52,811 In fact, on this slide there is no difference 104 00:04:52,811 --> 00:04:54,447 between the panda on the left 105 00:04:54,447 --> 00:04:56,286 and the panda on the right. 106 00:04:56,286 --> 00:04:58,753 When we present the image to convolutional network 107 00:04:58,753 --> 00:05:01,921 we use 32-bit floating point values. 108 00:05:01,921 --> 00:05:05,142 The monitor here can only display eight bits 109 00:05:05,142 --> 00:05:07,712 of color resolution, and we have made a change 110 00:05:07,712 --> 00:05:09,279 that's just barely too small 111 00:05:09,279 --> 00:05:12,613 to affect the smallest of those eight bits, 112 00:05:12,613 --> 00:05:14,411 but it effects the other 24 113 00:05:14,411 --> 00:05:17,345 of the 32-bit floating point representation, 114 00:05:17,345 --> 00:05:19,342 and that little tiny change is enough 115 00:05:19,342 --> 00:05:21,198 to fool the convolutional network 116 00:05:21,198 --> 00:05:25,365 into recognizing this image of a panda as being a gibbon. 117 00:05:26,420 --> 00:05:28,056 Another interesting thing is that 118 00:05:28,056 --> 00:05:29,857 it doesn't just change the class. 119 00:05:29,857 --> 00:05:32,881 It's not that we just barely found the decision boundary 120 00:05:32,881 --> 00:05:34,734 and just barely stepped across it. 121 00:05:34,734 --> 00:05:37,702 The convolutional network actually has much more confidence 122 00:05:37,702 --> 00:05:40,172 in its incorrect prediction, 123 00:05:40,172 --> 00:05:42,097 that the image on the right is a gibbon, 124 00:05:42,097 --> 00:05:45,891 than it had for the original being a panda. 125 00:05:45,891 --> 00:05:47,588 On the right, it believes that the image 126 00:05:47,588 --> 00:05:50,752 is a gibbon with 99.9% probability, 127 00:05:50,752 --> 00:05:53,848 so before it thought that there was about 128 00:05:53,848 --> 00:05:57,341 1/3 chance that it was something other than a panda, 129 00:05:57,341 --> 00:06:00,238 and now it's about as certain as it can possibly be 130 00:06:00,238 --> 00:06:02,417 that it's a gibbon. 131 00:06:02,417 --> 00:06:05,585 As a little bit of history, people have studied ways 132 00:06:05,585 --> 00:06:07,942 of computing attacks to fool 133 00:06:07,942 --> 00:06:09,656 different machine learning models 134 00:06:09,656 --> 00:06:13,596 since at least about 2004, and maybe earlier. 135 00:06:13,596 --> 00:06:15,305 For a long time this was done in the context 136 00:06:15,305 --> 00:06:17,772 of fooling spam detectors. 137 00:06:17,772 --> 00:06:21,406 In about 2013, Battista Biggio found 138 00:06:21,406 --> 00:06:24,161 that you could fool neural networks in this way, 139 00:06:24,161 --> 00:06:27,080 and around the same time my colleague, Christian Szegedy, 140 00:06:27,080 --> 00:06:29,311 found that you could make this kind of attack 141 00:06:29,311 --> 00:06:30,948 against deep neural networks 142 00:06:30,948 --> 00:06:33,147 just by using an optimization algorithm 143 00:06:33,147 --> 00:06:36,368 to search on the input of the image. 144 00:06:36,368 --> 00:06:37,952 A lot of what I'll be telling you about today 145 00:06:37,952 --> 00:06:40,111 is my own follow-up work on this topic, 146 00:06:40,111 --> 00:06:43,496 but I've spent a lot of my career over the past few years 147 00:06:43,496 --> 00:06:46,539 understanding why these attacks are possible 148 00:06:46,539 --> 00:06:50,706 and why it's so easy to fool these convolutional networks. 149 00:06:52,279 --> 00:06:54,129 When my colleague, Christian, 150 00:06:54,129 --> 00:06:57,104 first discovered this phenomenon 151 00:06:57,104 --> 00:07:01,237 independently from Battista Biggio but around the same time, 152 00:07:01,237 --> 00:07:04,404 he found that it was actually a result 153 00:07:05,652 --> 00:07:08,206 of a visualization he was trying to make. 154 00:07:08,206 --> 00:07:10,260 He wasn't studying security. 155 00:07:10,260 --> 00:07:12,401 He wasn't studying how to fool a neural network. 156 00:07:12,401 --> 00:07:14,686 Instead, he had a convolutional network 157 00:07:14,686 --> 00:07:16,736 that could recognize objects very well, 158 00:07:16,736 --> 00:07:19,134 and he wants to understand how it worked, 159 00:07:19,134 --> 00:07:23,301 so he thought that maybe he could take an image of a scene, 160 00:07:24,156 --> 00:07:26,350 for example a picture of a ship, 161 00:07:26,350 --> 00:07:28,784 and he could gradually transform that image 162 00:07:28,784 --> 00:07:31,428 into something that the network would recognize 163 00:07:31,428 --> 00:07:33,622 as being an airplane. 164 00:07:33,622 --> 00:07:35,513 Over the course of that transformation, 165 00:07:35,513 --> 00:07:38,844 he could see how the features of the input change. 166 00:07:38,844 --> 00:07:40,860 You might expect that maybe the background 167 00:07:34,360 --> 00:07:37,692 would turn blue to look like the sky behind an airplane, 167 00:07:44,192 --> 00:07:46,424 or you might expect that the ship 168 00:07:46,424 --> 00:07:48,883 would grow wings to look more like an airplane. 169 00:07:48,883 --> 00:07:51,209 You could conclude from that that the convolution 170 00:07:51,209 --> 00:07:56,124 uses the blue sky or uses the wings to recognize airplanes. 171 00:07:56,124 --> 00:07:59,019 That's actually not really what happened at all. 172 00:07:59,019 --> 00:08:01,212 Each of these panels here shows an animation 173 00:08:01,212 --> 00:08:03,737 that you read left to right, top to bottom. 174 00:08:03,737 --> 00:08:06,848 Each panel is another step of gradient ascent 175 00:08:06,848 --> 00:08:11,441 on the log probability that the input is an airplane 176 00:08:11,441 --> 00:08:14,067 according to a convolutional net model, 177 00:08:14,067 --> 00:08:18,833 and then we follow the gradient on the input to the image. 178 00:08:18,833 --> 00:08:20,585 You're probably used to following the gradient 179 00:08:20,585 --> 00:08:22,222 on the parameters of a model. 180 00:08:22,222 --> 00:08:23,840 You can use the back propagation algorithm 181 00:08:23,840 --> 00:08:26,182 to compute the gradient on the input image 182 00:08:26,182 --> 00:08:28,001 using exactly the same procedure 183 00:08:28,001 --> 00:08:29,816 that you would use to compute the gradient 184 00:08:29,816 --> 00:08:31,976 on the parameters. 185 00:08:31,976 --> 00:08:34,803 In this animation of the ship in the upper left, 186 00:08:34,803 --> 00:08:37,918 we see five panels that all look basically the same. 187 00:08:37,918 --> 00:08:39,339 Gradient descent doesn't seem 188 00:08:39,339 --> 00:08:40,793 to have moved the image at all, 189 00:08:40,793 --> 00:08:43,496 but by the last panel the network is completely confident 190 00:08:43,496 --> 00:08:45,287 that this is an airplane. 191 00:08:45,287 --> 00:08:47,580 When you first code up this kind of experiment, 192 00:08:47,580 --> 00:08:49,433 especially if you don't know what's going to happen, 193 00:08:49,433 --> 00:08:51,881 it feels a little bit like you have a bug in your script 194 00:08:51,881 --> 00:08:52,937 and you're just displaying 195 00:08:52,937 --> 00:08:54,761 the same image over and over again. 196 00:08:54,761 --> 00:08:55,952 The first time I did it, 197 00:08:55,952 --> 00:08:58,419 I couldn't believe it was happening, 198 00:08:58,419 --> 00:09:00,540 and I had to open up the images in NumPy, 199 00:09:00,540 --> 00:09:02,355 and take the difference of them, 200 00:09:02,355 --> 00:09:03,813 and make sure that there was actually 201 00:09:03,813 --> 00:09:07,359 a non-zero difference in there, but there is. 202 00:09:07,359 --> 00:09:09,250 I show several different animations here 203 00:09:09,250 --> 00:09:12,333 of a ship, a car, a cat, and a truck. 204 00:09:13,172 --> 00:09:15,817 The only one where I actually see any change at all 205 00:09:15,817 --> 00:09:18,250 is the image of the cat. 206 00:09:18,250 --> 00:09:21,038 The color of the cat's face changes a little bit, 207 00:09:21,038 --> 00:09:23,646 and maybe it becomes a little bit more 208 00:09:23,646 --> 00:09:25,969 like the color of a metal airplane. 209 00:09:25,969 --> 00:09:28,470 Other than that, I don't see any changes 210 00:09:28,470 --> 00:09:29,895 in any of these animations, 211 00:09:29,895 --> 00:09:33,908 and I don't see anything very suggestive of an airplane. 212 00:09:33,908 --> 00:09:36,985 So gradient descent, rather than turning the input 213 00:09:36,985 --> 00:09:39,240 into an example of an airplane, 214 00:09:39,240 --> 00:09:42,818 has found an image that fools the network 215 00:09:42,818 --> 00:09:45,519 into thinking that the input is an airplane. 216 00:09:45,519 --> 00:09:47,050 And if we were malicious attackers 217 00:09:47,050 --> 00:09:49,567 we didn't even have to work very hard to figure out 218 00:09:49,567 --> 00:09:51,102 how to fool the network. 219 00:09:51,102 --> 00:09:52,234 We just asked the network 220 00:09:52,234 --> 00:09:53,837 to give us an image of an airplane, 221 00:09:53,837 --> 00:09:56,516 and it gave us something that fools it into thinking 222 00:09:56,516 --> 00:09:59,016 that the input is an airplane. 223 00:10:00,310 --> 00:10:02,727 When Christian first published this work, 224 00:10:02,727 --> 00:10:05,175 a lot of articles came out with titles like, 225 00:10:05,175 --> 00:10:07,210 The Flaw Looking At Every Deep Neural Network, 226 00:10:07,210 --> 00:10:10,590 or Deep Learning has Deep Flaws. 227 00:10:10,590 --> 00:10:12,577 It's important to remember that these vulnerabilities 228 00:10:12,577 --> 00:10:15,903 apply to essentially every machine learning algorithm 229 00:10:15,903 --> 00:10:18,625 that we've studied so far. 230 00:10:18,625 --> 00:10:20,458 Some of them like RBF networks 231 00:10:20,458 --> 00:10:22,906 and partisan density estimators 232 00:10:22,906 --> 00:10:24,942 are able to resist this effect somewhat, 233 00:10:24,942 --> 00:10:27,908 but even very simple machine learning algorithms 234 00:10:27,908 --> 00:10:32,069 are highly vulnerable to adversarial examples. 235 00:10:32,069 --> 00:10:33,870 In this image, I show an animation 236 00:10:33,870 --> 00:10:37,038 of what happens when we attack a linear model, 237 00:10:37,038 --> 00:10:38,890 so it's not a deep algorithm at all. 238 00:10:38,890 --> 00:10:41,370 It's just a shallow softmax model. 239 00:10:41,370 --> 00:10:45,440 You multiply by a matrix, you add a vector of bias terms, 240 00:10:45,440 --> 00:10:47,223 you apply the softmax function, 241 00:10:47,223 --> 00:10:48,846 and you've got your probability distribution 242 00:10:48,846 --> 00:10:51,249 over the 10 MNIST classes. 243 00:10:51,249 --> 00:10:54,022 At the upper left, I start with an image of a nine, 244 00:10:54,022 --> 00:10:57,161 and then as we move left to right, top to bottom, 245 00:10:57,161 --> 00:11:00,141 I gradually transform it to be a zero. 246 00:11:00,141 --> 00:11:02,053 Where I've drawn the yellow box, 247 00:11:02,053 --> 00:11:05,640 the model assigns high probability to it being a zero. 248 00:11:05,640 --> 00:11:08,323 I forget exactly what my threshold was for high probability, 249 00:11:08,323 --> 00:11:11,856 but I think it was around 0.9 or so. 250 00:11:11,856 --> 00:11:13,503 Then as we move to the second row, 251 00:11:13,503 --> 00:11:15,462 I transform it into a one, 252 00:11:15,462 --> 00:11:17,136 and the second yellow box indicates 253 00:11:17,136 --> 00:11:18,932 where we've successfully fooled the model 254 00:11:18,932 --> 00:11:21,663 into thinking it's a one with high probability. 255 00:11:21,663 --> 00:11:23,878 And then as you read the rest of the yellow boxes 256 00:11:23,878 --> 00:11:25,250 left to right, top to bottom, 257 00:11:25,250 --> 00:11:27,691 we go through the twos, threes, fours, and so on, 258 00:11:27,691 --> 00:11:29,646 until finally at the lower right 259 00:11:29,646 --> 00:11:31,855 we have a nine that has a yellow box around it, 260 00:11:31,855 --> 00:11:33,794 and it actually looks like a nine, 261 00:11:33,794 --> 00:11:35,001 but in this case the only reason 262 00:11:35,001 --> 00:11:36,185 it actually looks like a nine 263 00:11:36,185 --> 00:11:39,369 is that we started the whole process with a nine. 264 00:11:39,369 --> 00:11:43,042 We successfully swept through all 10 classes of MNIST 265 00:11:43,042 --> 00:11:46,892 without substantially changing the image of the digit 266 00:11:46,892 --> 00:11:50,578 in any way that would interfere with human recognition. 267 00:11:50,578 --> 00:11:54,745 This linear model was actually extremely easy to fool. 268 00:11:55,879 --> 00:11:57,791 Besides linear models, we've also seen 269 00:11:57,791 --> 00:12:01,480 that we can fool many different kinds of linear models 270 00:12:01,480 --> 00:12:04,588 including logistic regression and SVMs. 271 00:12:04,588 --> 00:12:07,118 We've also found that we can fool decision trees, 272 00:12:07,118 --> 00:12:11,285 and to a lesser extent, nearest neighbors classifiers. 273 00:12:13,049 --> 00:12:16,605 We wanted to explain exactly why this happens. 274 00:12:16,605 --> 00:12:20,122 Back in about 2014, after we'd published the original paper 275 00:12:20,122 --> 00:12:22,934 where we'd said that these problems exist, 276 00:12:22,934 --> 00:12:25,929 we were trying to figure out why they happen. 277 00:12:25,929 --> 00:12:27,394 When we wrote our first paper, 278 00:12:27,394 --> 00:12:30,517 we thought that basically this is a form of overfitting, 279 00:12:30,517 --> 00:12:34,087 that you have a very complicated deep neural network, 280 00:12:34,087 --> 00:12:36,086 it learns to fit the training set, 281 00:12:36,086 --> 00:12:39,604 its behavior on the test set is somewhat undefined, 282 00:12:39,604 --> 00:12:41,858 and then it makes random mistakes 283 00:12:41,858 --> 00:12:44,023 that an attacker can exploit. 284 00:12:44,023 --> 00:12:45,778 Let's walk through what that story looks like 285 00:12:45,778 --> 00:12:47,650 somewhat concretely. 286 00:12:47,650 --> 00:12:50,885 I have here a training set of three blue X's 287 00:12:50,885 --> 00:12:53,105 and three green O's. 288 00:12:53,105 --> 00:12:54,364 We want to make a classifier 289 00:12:54,364 --> 00:12:57,435 that can recognize X's and recognize O's. 290 00:12:57,435 --> 00:12:59,806 We have a very complicated classifier 291 00:12:59,806 --> 00:13:01,972 that can easily fit the training set, 292 00:13:01,972 --> 00:13:03,633 so we represent everywhere it believes 293 00:13:03,633 --> 00:13:06,486 X's should be with blobs of blue color, 294 00:13:06,486 --> 00:13:08,369 and I've drawn a blob of blue 295 00:13:08,369 --> 00:13:10,629 around all of the training set X's, 296 00:13:10,629 --> 00:13:13,157 so it correctly classifies the training set. 297 00:13:13,157 --> 00:13:17,840 It also has a blob of green mass showing where the O's are, 298 00:13:17,840 --> 00:13:21,360 and it successfully fits all of the green training set O's, 299 00:13:21,360 --> 00:13:24,482 but then because this is a very complicated function 300 00:13:24,482 --> 00:13:26,850 and it has just way more parameters 301 00:13:26,850 --> 00:13:29,998 than it actually needs to represent the training task, 302 00:13:29,998 --> 00:13:33,168 it throws little blobs of probability mass 303 00:13:33,168 --> 00:13:35,680 around the rest of space randomly. 304 00:13:35,680 --> 00:13:37,566 On the left there's a blob of green space 305 00:13:37,566 --> 00:13:40,121 that's kind of near the training set X's, 306 00:13:40,121 --> 00:13:42,032 and I've drawn a red X there to show 307 00:13:42,032 --> 00:13:43,740 that maybe this would be an adversarial example 308 00:13:43,740 --> 00:13:46,441 where we expect the classification to be X, 309 00:13:46,441 --> 00:13:48,570 but the model assigns O. 310 00:13:48,570 --> 00:13:51,663 On the right, I've shown that there's a red O 311 00:13:51,663 --> 00:13:53,826 where we have another adversarial example. 312 00:13:53,826 --> 00:13:55,655 We're very near the other O's. 313 00:13:55,655 --> 00:13:58,175 We might expect the model to assign this class to be an O, 314 00:13:58,175 --> 00:14:00,375 and yet because it's drawn blue mass there 315 00:14:00,375 --> 00:14:04,060 it's actually assigning it to be an X. 316 00:14:04,060 --> 00:14:05,614 If overfitting is really the story 317 00:14:05,614 --> 00:14:09,105 then each adversarial example is more or less 318 00:14:09,105 --> 00:14:12,877 the result of bad luck and also more or less unique. 319 00:14:12,877 --> 00:14:14,455 If we fit the model again 320 00:14:14,455 --> 00:14:16,378 or we fit a slightly different model 321 00:14:16,378 --> 00:14:19,137 we would expect to make different random mistakes 322 00:14:19,137 --> 00:14:22,338 on this points that are off the training set, 323 00:14:22,338 --> 00:14:25,131 but that was actually not what we found at all. 324 00:14:25,131 --> 00:14:28,017 We found that many different models would misclassify 325 00:14:28,017 --> 00:14:30,533 the same adversarial examples, 326 00:14:30,533 --> 00:14:33,271 and they would assign the same class to them. 327 00:14:33,271 --> 00:14:36,191 We also found that if we took the difference 328 00:14:36,191 --> 00:14:40,429 between an original example and an adversarial example 329 00:14:40,429 --> 00:14:43,226 then we had a direction in input space 330 00:14:43,226 --> 00:14:46,719 and we could add that same offset vector 331 00:14:46,719 --> 00:14:49,234 to any clean example, and we would almost always 332 00:14:49,234 --> 00:14:52,067 get an adversarial example as a result. 333 00:14:52,067 --> 00:14:52,935 So we started to realize 334 00:14:52,935 --> 00:14:55,283 that there was systematic effect going on here, 335 00:14:55,283 --> 00:14:57,842 not just a random effect. 336 00:14:57,842 --> 00:14:59,368 That led us to another idea 337 00:14:59,368 --> 00:15:01,317 which is that adversarial examples 338 00:15:01,317 --> 00:15:03,537 might actually be more like underfitting 339 00:15:03,537 --> 00:15:05,538 rather than overfitting. 340 00:15:05,538 --> 00:15:09,141 They might actually come from the model being too linear. 341 00:15:09,141 --> 00:15:11,267 Here I draw the same task again 342 00:15:11,267 --> 00:15:13,655 where we have the same manifold of O's 343 00:15:13,655 --> 00:15:15,929 and the same line of X's, 344 00:15:15,929 --> 00:15:19,205 and this time I fit a linear model to the data set 345 00:15:19,205 --> 00:15:23,772 rather than fitting a high capacity, non-linear model to it. 346 00:15:23,772 --> 00:15:26,103 We see that we get a dividing hyperplane 347 00:15:26,103 --> 00:15:29,082 running in between the two classes. 348 00:15:29,082 --> 00:15:30,877 This hyperplane doesn't really capture 349 00:15:30,877 --> 00:15:33,803 the true structure of the classes. 350 00:15:33,803 --> 00:15:37,167 The O's are clearly arranged in a C-shaped manifold. 351 00:15:37,167 --> 00:15:40,310 If we keep walking past the end of the O's, 352 00:15:40,310 --> 00:15:43,734 we've crossed the decision boundary and we've drawn a red O 353 00:15:43,734 --> 00:15:46,432 where even though we're very near the decision boundary 354 00:15:46,432 --> 00:15:49,688 and near other O's we believe that it is now an X. 355 00:15:49,688 --> 00:15:53,036 Similarly we can take steps that go from near X's 356 00:15:53,036 --> 00:15:57,646 to just over the line that are classified as O's. 357 00:15:57,646 --> 00:15:59,638 Another thing that's somewhat unusual about this plot 358 00:15:59,638 --> 00:16:03,208 is that if we look at the lower left or upper right corners 359 00:16:03,208 --> 00:16:05,428 these corners are very confidently classified 360 00:16:05,428 --> 00:16:09,538 as being X's on the lower left or O's on the upper right 361 00:16:09,538 --> 00:16:12,498 even though we've never seen any data over there at all. 362 00:16:12,498 --> 00:16:14,710 The linear model family forces the model 363 00:16:14,710 --> 00:16:17,604 to have very high confidence in these regions 364 00:16:17,604 --> 00:16:21,354 that are very far from the decision boundary. 365 00:16:22,757 --> 00:16:25,923 We've seen that linear models can actually assign 366 00:16:25,923 --> 00:16:28,478 really unusual confidence as you move very far 367 00:16:28,478 --> 00:16:30,016 from the decision boundary, 368 00:16:30,016 --> 00:16:31,828 even if there isn't any data there, 369 00:16:31,828 --> 00:16:34,106 but are deep neural networks actually 370 00:16:34,106 --> 00:16:36,326 anything like linear models? 371 00:16:36,326 --> 00:16:38,598 Could linear models actually explain anything 372 00:16:38,598 --> 00:16:41,190 about how it is that deep neural nets fail? 373 00:16:41,190 --> 00:16:43,114 It turns out that modern deep neural nets 374 00:16:43,114 --> 00:16:45,482 are actually very piecewise linear, 375 00:16:45,482 --> 00:16:47,648 so rather than being a single linear function 376 00:16:47,648 --> 00:16:49,162 they are piecewise linear 377 00:16:49,162 --> 00:16:52,412 with maybe not that many linear pieces. 378 00:16:53,588 --> 00:16:55,378 If we use rectified linear units 379 00:16:55,378 --> 00:16:59,545 then the mapping from the input image to the output logits 380 00:17:00,460 --> 00:17:03,662 is literally a piecewise linear function. 381 00:17:03,662 --> 00:17:06,750 By the logits I mean the un-normalized log probabilities 382 00:17:06,750 --> 00:17:11,701 before we apply the softmax op at the output of the model. 383 00:17:11,701 --> 00:17:13,161 There are other neural networks 384 00:17:13,161 --> 00:17:14,955 like maxout networks that are also 385 00:17:14,955 --> 00:17:17,145 literally piecewise linear. 386 00:17:17,146 --> 00:17:19,915 And then there are several that become very close to it. 387 00:17:19,915 --> 00:17:22,627 Before rectified linear units became popular 388 00:17:22,627 --> 00:17:27,019 most people used to use sigmoid units of one form or another 389 00:17:27,019 --> 00:17:30,369 either logistic sigmoid or hyperbolic tangent units. 390 00:17:30,369 --> 00:17:33,624 These sigmoidal units have to be carefully tuned, 391 00:17:33,624 --> 00:17:35,715 especially at initialization 392 00:17:35,715 --> 00:17:37,936 so that you spend most of your time 393 00:17:37,936 --> 00:17:40,396 near the center of the sigmoid 394 00:17:40,396 --> 00:17:43,527 where the sigmoid is approximately linear. 395 00:17:43,527 --> 00:17:46,578 Then finally, the LSTM, a kind of recurrent network 396 00:17:46,578 --> 00:17:49,641 that is one of the most popular recurrent networks today, 397 00:17:49,641 --> 00:17:52,769 uses addition from one time step to the next 398 00:17:52,769 --> 00:17:56,859 in order to accumulate and remember information over time. 399 00:17:56,859 --> 00:18:00,021 Addition is a particularly simple form of linearity, 400 00:18:00,021 --> 00:18:01,501 so we can see that the interaction 401 00:18:01,501 --> 00:18:06,055 from a very distant time step in the past and the present 402 00:18:06,055 --> 00:18:09,330 is highly linear within an LSTM. 403 00:18:09,330 --> 00:18:11,647 Now to be clear, I'm speaking about the mapping 404 00:18:11,647 --> 00:18:14,417 from the input of the model to the output of the model. 405 00:18:14,417 --> 00:18:17,155 That's what I'm saying is close to being linear 406 00:18:17,155 --> 00:18:21,128 or is piecewise linear with relatively few pieces. 407 00:18:21,128 --> 00:18:23,351 The mapping from the parameters of the network 408 00:18:23,351 --> 00:18:26,125 to the output of the network is non-linear 409 00:18:26,125 --> 00:18:29,345 because the weight matrices at each layer of the network 410 00:18:29,345 --> 00:18:31,394 are multiplied together. 411 00:18:31,394 --> 00:18:34,249 So we actually get extremely non-linear reactions 412 00:18:34,249 --> 00:18:36,434 between parameters and the output. 413 00:18:36,434 --> 00:18:39,348 That's what makes training a neural network so difficult. 414 00:18:39,348 --> 00:18:42,315 But the mapping from the input to the output 415 00:18:42,315 --> 00:18:45,177 is much more linear and predictable, 416 00:18:45,177 --> 00:18:47,347 and it means that optimization problems 417 00:18:47,347 --> 00:18:50,938 that aim to optimize the input to the model 418 00:18:50,938 --> 00:18:53,600 are much easier than optimization problems 419 00:18:53,600 --> 00:18:57,169 that aim to optimize the parameters. 420 00:18:57,169 --> 00:18:59,631 If we go and look for this happening in practice 421 00:18:59,631 --> 00:19:01,870 we can take a convolutional network 422 00:19:01,870 --> 00:19:04,273 and trace out a one-dimensional path 423 00:19:04,273 --> 00:19:07,013 through its input space. 424 00:19:07,013 --> 00:19:09,818 So what we're doing here is we're choosing a clean example. 425 00:19:09,818 --> 00:19:12,763 It's an image of a white car on a red background, 426 00:19:12,763 --> 00:19:14,856 and we are choosing a direction 427 00:19:14,856 --> 00:19:16,623 that will travel through space. 428 00:19:16,623 --> 00:19:19,403 We are going to have a coefficient epsilon 429 00:19:19,403 --> 00:19:21,273 that we multiply by this direction. 430 00:19:21,273 --> 00:19:22,848 When epsilon is negative 30, 431 00:19:22,848 --> 00:19:24,544 like at the left end of the plot, 432 00:19:24,544 --> 00:19:28,266 we're subtracting off a lot of this unit vector direction. 433 00:19:28,266 --> 00:19:30,945 When epsilon is zero, like in the middle of the plot, 434 00:19:30,945 --> 00:19:33,964 we're visiting the original image from the data set, 435 00:19:33,964 --> 00:19:36,074 and when epsilon is positive 30, 436 00:19:36,074 --> 00:19:37,645 like at the right end of the plot, 437 00:19:37,645 --> 00:19:41,228 we're adding this direction onto the input. 438 00:19:42,622 --> 00:19:45,079 In the panel on the left, I show you an animation 439 00:19:45,079 --> 00:19:47,666 where we move from epsilon equals negative 30 440 00:19:47,666 --> 00:19:50,820 as up to epsilon equals positive 30. 441 00:19:50,820 --> 00:19:53,581 You read the animation left to right, top to bottom, 442 00:19:53,581 --> 00:19:56,031 and everywhere that there's a yellow box 443 00:19:56,031 --> 00:20:00,198 the input has correctly recognized as being a car. 444 00:20:01,379 --> 00:20:04,354 On the upper left, you see that it looks mostly blue. 445 00:20:04,354 --> 00:20:07,817 On the lower right, it's hard to tell what's going on. 446 00:20:07,817 --> 00:20:10,381 It's kind of reddish and so on. 447 00:20:10,381 --> 00:20:13,772 In the middle row, just after where the yellow boxes end 448 00:20:13,772 --> 00:20:14,995 you can see pretty clearly 449 00:20:14,995 --> 00:20:17,324 that it's a car on a red background, 450 00:20:17,324 --> 00:20:20,747 though the image is small on these slides. 451 00:20:20,747 --> 00:20:23,780 What's interesting to look at here is the logits 452 00:20:23,780 --> 00:20:25,168 that the model outputs. 453 00:20:25,168 --> 00:20:30,115 This is a deep convolutional rectified linear unit network. 454 00:20:30,115 --> 00:20:32,326 Because it uses rectified linear units, 455 00:20:32,326 --> 00:20:36,160 we know that the output is a piecewise linear function 456 00:20:36,160 --> 00:20:38,559 of the input to the model. 457 00:20:38,559 --> 00:20:40,835 The main question we're asking by making this plot 458 00:20:40,835 --> 00:20:42,820 is how many different pieces 459 00:20:42,820 --> 00:20:45,628 does this piecewise linear function have 460 00:20:45,628 --> 00:20:48,552 if we look at one particular cross section. 461 00:20:48,552 --> 00:20:50,835 You might think that maybe a deep net 462 00:20:50,835 --> 00:20:52,135 is going to represent some extremely 463 00:20:52,135 --> 00:20:54,749 wiggly complicated function with lots and lots 464 00:20:54,749 --> 00:20:58,326 of linear pieces no matter which cross section you look in. 465 00:20:58,326 --> 00:21:01,408 Or we might find that it has more or less two pieces 466 00:21:01,408 --> 00:21:03,825 for each function we look at. 467 00:21:04,667 --> 00:21:07,201 Each of the different curves on this plot 468 00:21:07,201 --> 00:21:10,245 is the logits for a different class. 469 00:21:10,245 --> 00:21:13,864 We see that out at the tails of the plot 470 00:21:13,864 --> 00:21:16,528 that the frog class is the most likely, 471 00:21:16,528 --> 00:21:18,846 and the frog class basically looks like 472 00:21:18,846 --> 00:21:20,846 a big v-shaped function. 473 00:21:21,928 --> 00:21:24,193 The logits for the frog class become very high 474 00:21:24,193 --> 00:21:27,270 when epsilon is negative 30 or positive 30, 475 00:21:27,270 --> 00:21:29,253 and they drop down and become a little bit negative 476 00:21:29,253 --> 00:21:31,003 when epsilon is zero. 477 00:21:32,833 --> 00:21:36,250 The car class, listed as automobile here, 478 00:21:37,764 --> 00:21:39,856 it's actually high in the middle, 479 00:21:39,856 --> 00:21:42,950 and the car is correctly recognized. 480 00:21:42,950 --> 00:21:44,944 As we sweep out to very negative epsilon, 481 00:21:44,944 --> 00:21:47,397 the logits for the car class do increase, 482 00:21:47,397 --> 00:21:49,033 but they don't increase nearly as quickly 483 00:21:49,033 --> 00:21:51,553 as the logits for the frog class. 484 00:21:51,553 --> 00:21:52,811 So, we've found a direction 485 00:21:52,811 --> 00:21:54,793 that's associated with the frog class 486 00:21:54,793 --> 00:21:59,041 and as we follow it out to a relatively large perturbation, 487 00:21:59,041 --> 00:22:02,334 we find that the model extrapolates linearly 488 00:22:02,334 --> 00:22:04,873 and begins to make a very unreasonable prediction 489 00:22:04,873 --> 00:22:07,984 that the frog class is extremely likely 490 00:22:07,984 --> 00:22:09,971 just because we've moved for a long time 491 00:22:09,971 --> 00:22:12,073 in this direction that was locally associated 492 00:22:12,073 --> 00:22:15,240 with the frog class being more likely. 493 00:22:17,550 --> 00:22:20,694 When we actually go and construct adversarial examples, 494 00:22:20,694 --> 00:22:23,200 we need to remember that we're able to get 495 00:22:23,200 --> 00:22:24,784 quite a large perturbation 496 00:22:24,784 --> 00:22:26,829 without changing the image very much 497 00:22:26,829 --> 00:22:29,912 as far as a human being is concerned. 498 00:22:30,882 --> 00:22:33,852 So here I show you a handwritten digit three, 499 00:22:33,852 --> 00:22:36,395 and I'm going to change it in several different ways, 500 00:22:36,395 --> 00:22:37,923 and all of these changes have 501 00:22:37,923 --> 00:22:40,806 the same L2 norm perturbation. 502 00:22:40,806 --> 00:22:44,421 In the top row, I'm going to change the three into a seven 503 00:22:44,421 --> 00:22:47,752 just by looking for the nearest seven in the training set. 504 00:22:47,752 --> 00:22:49,518 The difference between those two 505 00:22:49,518 --> 00:22:53,527 is this image that looks a little bit like the seven 506 00:22:53,527 --> 00:22:55,187 wrapped in some black lines. 507 00:22:55,187 --> 00:22:57,813 So here white pixels in the middle image 508 00:22:57,813 --> 00:22:59,808 in the perturbation column, 509 00:22:59,808 --> 00:23:02,184 the white pixels represent adding something 510 00:23:02,184 --> 00:23:04,830 and black pixels represent subtracting something 511 00:23:04,830 --> 00:23:08,142 as you move from the left column to the right column. 512 00:23:08,142 --> 00:23:11,401 So when we take the three and we apply this perturbation 513 00:23:11,401 --> 00:23:13,417 that transforms it into a seven, 514 00:23:13,417 --> 00:23:16,531 we can measure the L2 norm of that perturbation. 515 00:23:16,531 --> 00:23:20,236 And it turns out to have an L2 norm of 3.96. 516 00:23:20,236 --> 00:23:21,818 That gives you kind of a reference 517 00:23:21,818 --> 00:23:24,790 for how big these perturbations can be. 518 00:23:24,790 --> 00:23:26,521 In the middle row, we apply a perturbation 519 00:23:26,521 --> 00:23:28,302 of exactly the same size, 520 00:23:28,302 --> 00:23:30,500 but with the direction chosen randomly. 521 00:23:30,500 --> 00:23:32,065 In this case we don't actually change 522 00:23:32,065 --> 00:23:33,720 the class of the three at all, 523 00:23:33,720 --> 00:23:35,377 we just get some random noise 524 00:23:35,377 --> 00:23:37,825 that didn't really change the class. 525 00:23:37,825 --> 00:23:41,373 A human could still easily read it as being a three. 526 00:23:41,373 --> 00:23:44,285 And then finally at the very bottom row, 527 00:23:44,285 --> 00:23:46,230 we take the three and we just erase a piece of it 528 00:23:46,230 --> 00:23:48,011 with a perturbation of the same norm 529 00:23:48,011 --> 00:23:50,334 and we turn it into something 530 00:23:50,334 --> 00:23:52,422 that doesn't have any class at all. 531 00:23:52,422 --> 00:23:53,714 It's not a three, it's not a seven, 532 00:23:53,714 --> 00:23:56,254 it's just a defective input. 533 00:23:56,254 --> 00:23:57,568 All of these changes can happen 534 00:23:57,568 --> 00:24:00,664 with the same L2 norm perturbation. 535 00:24:00,664 --> 00:24:03,025 And actually a lot of the time with adversarial examples, 536 00:24:03,025 --> 00:24:06,011 you make perturbations that have an even larger L2 norm. 537 00:24:06,011 --> 00:24:07,216 What's going on is that 538 00:24:07,216 --> 00:24:09,143 there are several different pixels in the image, 539 00:24:09,143 --> 00:24:12,131 and so small changes to individual pixels 540 00:24:12,131 --> 00:24:15,227 can add up to relatively large vectors. 541 00:24:15,227 --> 00:24:17,566 For larger datasets like ImageNet, 542 00:24:17,566 --> 00:24:18,990 where there's even more pixels, 543 00:24:18,990 --> 00:24:21,184 you can make very small changes to each pixel 544 00:24:21,184 --> 00:24:24,174 that travel very far in vector space 545 00:24:24,174 --> 00:24:26,368 as measured by the L2 norm. 546 00:24:26,368 --> 00:24:28,505 That means that you can actually make changes 547 00:24:28,505 --> 00:24:30,093 that are almost imperceptible 548 00:24:30,093 --> 00:24:31,605 but actually move you really far 549 00:24:31,605 --> 00:24:33,477 and get a large dot product 550 00:24:33,477 --> 00:24:36,137 with the coefficients of the linear function 551 00:24:36,137 --> 00:24:38,695 that the model represents. 552 00:24:38,695 --> 00:24:39,832 It also means that when 553 00:24:39,832 --> 00:24:41,467 we're constructing adversarial examples, 554 00:24:41,467 --> 00:24:44,838 we need to make sure that the adversarial example procedure 555 00:24:44,838 --> 00:24:46,022 isn't able to do what happened 556 00:24:46,022 --> 00:24:48,240 in the top row of this slide here. 557 00:24:48,240 --> 00:24:49,627 So in the top row of this slide, 558 00:24:49,627 --> 00:24:50,756 we took the three and we actually 559 00:24:50,756 --> 00:24:52,454 just changed it into a seven. 560 00:24:52,454 --> 00:24:53,856 So when the model says that the image 561 00:24:53,856 --> 00:24:56,232 in the upper right is a seven, it's not a mistake. 562 00:24:56,232 --> 00:24:59,145 We actually just changed the input class. 563 00:24:59,145 --> 00:25:00,499 When we build adversarial examples, 564 00:25:00,499 --> 00:25:02,928 we want to make sure that we're measuring real mistakes. 565 00:25:02,928 --> 00:25:04,459 If we're experimenters studying 566 00:25:04,459 --> 00:25:06,259 how easy a network is to fool, 567 00:25:06,259 --> 00:25:08,146 we want to make sure that we're actually fooling it 568 00:25:08,146 --> 00:25:11,515 and not just changing the input class. 569 00:25:11,515 --> 00:25:13,535 And if we're an attacker, we actually want to make sure 570 00:25:13,535 --> 00:25:17,457 that we're causing misbehavior in the system. 571 00:25:17,457 --> 00:25:19,689 To do that, when we build adversarial examples, 572 00:25:19,689 --> 00:25:24,134 we use the maxnorm to constrain the perturbation. 573 00:25:24,134 --> 00:25:26,726 Basically this says that no pixel can change 574 00:25:26,726 --> 00:25:28,812 by more than some amount epsilon. 575 00:25:28,812 --> 00:25:30,991 So the L2 norm can get really big, 576 00:25:30,991 --> 00:25:33,335 but you can't concentrate all the changes 577 00:25:33,335 --> 00:25:35,908 for that L2 norm to erase pieces of the digit, 578 00:25:35,908 --> 00:25:39,701 like in the bottom row here we erased the top of a three. 579 00:25:39,701 --> 00:25:42,604 One very fast way to build an adversarial example 580 00:25:42,604 --> 00:25:45,503 is just to take the gradient of the cost 581 00:25:45,503 --> 00:25:47,140 that you used to train the network 582 00:25:47,140 --> 00:25:48,663 with respect to the input, 583 00:25:48,663 --> 00:25:51,312 and then take the sign of that gradient. 584 00:25:51,312 --> 00:25:55,708 The sign is essentially enforcing the maxnorm constraint. 585 00:25:55,708 --> 00:25:58,550 You're only allowed to change the input by 586 00:25:58,550 --> 00:26:00,690 up to epsilon at each pixel, 587 00:26:00,690 --> 00:26:02,381 so if you just take the sign it tells you 588 00:26:02,381 --> 00:26:04,761 whether you want to add epsilon or subtract epsilon 589 00:26:04,761 --> 00:26:07,010 in order to hurt the network. 590 00:26:07,010 --> 00:26:08,844 You can view this as taking the observation 591 00:26:08,844 --> 00:26:10,790 that the network is more or less linear, 592 00:26:10,790 --> 00:26:12,211 as we showed on this slide, 593 00:26:12,211 --> 00:26:14,265 and using that to motivate 594 00:26:14,265 --> 00:26:17,918 building a first order Taylor series approximation 595 00:26:17,918 --> 00:26:21,105 of the neural network's cost. 596 00:26:21,105 --> 00:26:24,508 And then subject to that Taylor series approximation, 597 00:26:24,508 --> 00:26:26,106 we want to maximize the cost 598 00:26:26,106 --> 00:26:28,898 following this maxnorm constraint. 599 00:26:28,898 --> 00:26:30,590 And that gives us this technique that we call 600 00:26:30,590 --> 00:26:32,785 the fast gradient sign method. 601 00:26:32,785 --> 00:26:34,350 If you want to just get your hands dirty 602 00:26:34,350 --> 00:26:36,835 and start making adversarial examples really quickly, 603 00:26:36,835 --> 00:26:38,764 or if you have an algorithm where you want to train 604 00:26:38,764 --> 00:26:41,534 on adversarial examples in the inner loop of learning, 605 00:26:41,534 --> 00:26:43,402 this method will make adversarial examples for you 606 00:26:43,402 --> 00:26:45,134 very, very quickly. 607 00:26:45,134 --> 00:26:47,942 In practice you should also use other methods, 608 00:26:47,942 --> 00:26:50,353 like Nicholas Carlini's attack based on 609 00:26:50,353 --> 00:26:52,660 multiple steps of the Adam optimizer, 610 00:26:52,660 --> 00:26:55,212 to make sure that you have a very strong attack 611 00:26:55,212 --> 00:26:57,359 that you bring out when you think you have a model 612 00:26:57,359 --> 00:26:59,678 that might be more powerful. 613 00:26:59,678 --> 00:27:02,145 A lot of the time people find that they can defeat 614 00:27:02,145 --> 00:27:03,460 the fast gradient sign method 615 00:27:03,460 --> 00:27:05,740 and think that they've built a successful defense, 616 00:27:05,740 --> 00:27:08,769 but then when you bring out a more powerful method 617 00:27:08,769 --> 00:27:10,444 that takes longer to evaluate, 618 00:27:10,444 --> 00:27:12,566 they find that they can't overcome 619 00:27:12,566 --> 00:27:16,066 the more computationally expensive attack. 620 00:27:18,043 --> 00:27:20,090 I've told you that adversarial examples happen 621 00:27:20,090 --> 00:27:22,036 because the model is very linear. 622 00:27:22,036 --> 00:27:23,529 And then I told you that we could 623 00:27:23,529 --> 00:27:25,132 use this linearity assumption 624 00:27:25,132 --> 00:27:28,694 to build this attack, the fast gradient sign method. 625 00:27:28,694 --> 00:27:31,900 This method, when applied to a regular neural network 626 00:27:31,900 --> 00:27:34,079 that doesn't have any special defenses, 627 00:27:34,079 --> 00:27:38,328 will get over a 99% attack success rate. 628 00:27:38,328 --> 00:27:40,377 So that seems to confirm, somewhat, 629 00:27:40,377 --> 00:27:42,936 this hypothesis that adversarial examples 630 00:27:42,936 --> 00:27:45,054 come from the model being far too linear 631 00:27:45,054 --> 00:27:48,964 and extrapolating in linear fashions when it shouldn't. 632 00:27:48,964 --> 00:27:51,514 Well we can actually go looking for some more evidence. 633 00:27:51,514 --> 00:27:54,417 My friend David Warde-Farley and I built these maps 634 00:27:54,417 --> 00:27:57,172 of the decision boundaries of neural networks. 635 00:27:57,172 --> 00:27:58,809 And we found that they are consistent 636 00:27:58,809 --> 00:28:02,140 with the linearity hypothesis. 637 00:28:02,140 --> 00:28:04,478 So the FGSM is that attack method 638 00:28:04,478 --> 00:28:06,244 that I described in the previous slide, 639 00:28:06,244 --> 00:28:08,260 where we take the sign of the gradient. 640 00:28:08,260 --> 00:28:09,537 We'd like to build a map 641 00:28:09,537 --> 00:28:13,353 of a two-dimensional cross section of input space 642 00:28:13,353 --> 00:28:15,760 and show which classes are assigned 643 00:28:15,760 --> 00:28:18,556 to the data at each point. 644 00:28:18,556 --> 00:28:21,397 In the grid on the right, each different cell, 645 00:28:21,397 --> 00:28:23,308 each little square within the grid, 646 00:28:23,308 --> 00:28:27,715 is a map of a CIFAR-10 classifier's decision boundary, 647 00:28:27,715 --> 00:28:29,932 with each cell corresponding to a different 648 00:28:29,932 --> 00:28:32,668 CIFAR-10 testing sample. 649 00:28:32,668 --> 00:28:34,624 On the left I show you a little legend 650 00:28:34,624 --> 00:28:37,867 where you can understand what each cell means. 651 00:28:37,867 --> 00:28:40,927 The very center of each cell corresponds to 652 00:28:40,927 --> 00:28:43,338 the original example from the CIFAR-10 dataset 653 00:28:43,338 --> 00:28:45,590 with no modification. 654 00:28:45,590 --> 00:28:47,534 As we move left to right in the cell, 655 00:28:47,534 --> 00:28:48,561 we're moving in the direction 656 00:28:48,561 --> 00:28:50,918 of the fast gradient sign method attack. 657 00:28:50,918 --> 00:28:53,076 So just the sign of the gradient. 658 00:28:53,076 --> 00:28:54,897 As we move up and down within the cell, 659 00:28:54,897 --> 00:28:58,243 we're moving in a random direction that's orthogonal to 660 00:28:58,243 --> 00:29:00,907 the fast gradient sign method direction. 661 00:29:00,907 --> 00:29:04,204 So we get to see a cross section, a 2D cross section 662 00:29:04,204 --> 00:29:06,454 of CIFAR-10 decision space. 663 00:29:07,455 --> 00:29:09,604 At each pixel within this map, 664 00:29:09,604 --> 00:29:13,291 we plot a color that tells us which class is assigned there. 665 00:29:13,291 --> 00:29:15,199 We use white pixels to indicate that 666 00:29:15,199 --> 00:29:17,174 the correct class was chosen, 667 00:29:17,174 --> 00:29:19,538 and then we used different colors to represent 668 00:29:19,538 --> 00:29:21,931 all of the other incorrect classes. 669 00:29:21,931 --> 00:29:23,908 You can see that in nearly all 670 00:29:23,908 --> 00:29:25,641 of the grid cells on the right, 671 00:29:25,641 --> 00:29:29,222 roughly the left half of the image is white. 672 00:29:29,222 --> 00:29:31,564 So roughly the left half of the image 673 00:29:31,564 --> 00:29:33,648 has been correctly classified. 674 00:29:33,648 --> 00:29:36,761 As we move to the right, we see that there is usually 675 00:29:36,761 --> 00:29:39,537 a different color on the right half. 676 00:29:39,537 --> 00:29:41,441 And the boundaries between these regions 677 00:29:41,441 --> 00:29:43,118 are approximately linear. 678 00:29:43,118 --> 00:29:45,153 What's going on here is that the fast gradient sign method 679 00:29:45,153 --> 00:29:47,116 has identified a direction 680 00:29:47,116 --> 00:29:50,283 where if we get a large dot product with that direction 681 00:29:50,283 --> 00:29:52,694 we can get an adversarial example. 682 00:29:52,694 --> 00:29:54,729 And from this we can see that adversarial examples 683 00:29:54,729 --> 00:29:57,896 live more or less in linear subspaces. 684 00:29:59,299 --> 00:30:01,334 When we first discovered adversarial examples, 685 00:30:01,334 --> 00:30:04,358 we thought that they might live in little tiny pockets. 686 00:30:04,358 --> 00:30:06,643 In the first paper we actually speculated that 687 00:30:06,643 --> 00:30:09,057 maybe they're a little bit like the rational numbers, 688 00:30:09,057 --> 00:30:11,956 hiding out finely tiled among the real numbers, 689 00:30:11,956 --> 00:30:15,862 with nearly every real number being near a rational number. 690 00:30:15,862 --> 00:30:17,212 We thought that because we were able to find 691 00:30:17,212 --> 00:30:18,940 an adversarial example corresponding 692 00:30:18,940 --> 00:30:22,147 to every clean example that we loaded into the network. 693 00:30:22,147 --> 00:30:23,620 After doing this further analysis, 694 00:30:23,620 --> 00:30:27,216 we found that what's happening is that every real example 695 00:30:27,216 --> 00:30:29,688 is near one of these linear decision boundaries 696 00:30:29,688 --> 00:30:32,908 where you cross over into an adversarial subspace. 697 00:30:32,908 --> 00:30:35,193 And once you're in that adversarial subspace, 698 00:30:35,193 --> 00:30:38,738 all the other points nearby are also adversarial examples 699 00:30:38,738 --> 00:30:40,790 that will be misclassified. 700 00:30:40,790 --> 00:30:42,412 This has security implications 701 00:30:42,412 --> 00:30:46,154 because it means you only need to get the direction right. 702 00:30:46,154 --> 00:30:48,854 You don't need to find an exact coordinate in space. 703 00:30:48,854 --> 00:30:50,640 You just need to find a direction 704 00:30:50,640 --> 00:30:54,382 that has a large dot product with the sign of the gradient. 705 00:30:54,382 --> 00:30:56,308 And once you move more or less approximately 706 00:30:56,308 --> 00:30:59,808 in that direction, you can fool the model. 707 00:31:01,161 --> 00:31:02,726 We also made another cross section 708 00:31:02,726 --> 00:31:05,659 where after using the left-right axis 709 00:31:05,659 --> 00:31:07,564 as the fast gradient sign method, 710 00:31:07,564 --> 00:31:09,187 we looked for a second direction 711 00:31:09,187 --> 00:31:11,884 that has high dot product with the gradient 712 00:31:11,884 --> 00:31:14,966 so we could make both axes adversarial. 713 00:31:14,966 --> 00:31:16,363 And in this case you see that we get 714 00:31:16,363 --> 00:31:18,038 linear decision boundaries. 715 00:31:18,038 --> 00:31:21,475 They're now oriented diagonally rather than vertically, 716 00:31:21,475 --> 00:31:23,207 but you can see that there's actually 717 00:31:23,207 --> 00:31:24,609 this two-dimensional subspace 718 00:31:24,609 --> 00:31:29,217 of adversarial examples that we can cross into. 719 00:31:29,217 --> 00:31:30,854 Finally it's important to remember 720 00:31:30,854 --> 00:31:33,158 that adversarial examples are not noise. 721 00:31:33,158 --> 00:31:35,284 You can add a lot of noise to an adversarial example 722 00:31:35,284 --> 00:31:37,334 and it will stay adversarial. 723 00:31:37,334 --> 00:31:39,460 You can add a lot of noise to a clean example 724 00:31:39,460 --> 00:31:40,877 and it will stay clean. 725 00:31:40,877 --> 00:31:42,355 Here we make random cross sections 726 00:31:42,355 --> 00:31:45,417 where both axes are randomly chosen directions. 727 00:31:45,417 --> 00:31:47,177 And you see that on CIFAR-10, 728 00:31:47,177 --> 00:31:49,229 most of the cells are completely white, 729 00:31:49,229 --> 00:31:51,916 meaning that they're correctly classified to start with, 730 00:31:51,916 --> 00:31:54,993 and when you add noise they stay correctly classified. 731 00:31:54,993 --> 00:31:56,953 We also see that the model makes some mistakes 732 00:31:56,953 --> 00:31:58,915 because this is the test set. 733 00:31:58,915 --> 00:32:01,651 And generally if a test example starts out misclassified, 734 00:32:01,651 --> 00:32:03,724 adding the noise doesn't change it. 735 00:32:03,724 --> 00:32:05,861 There are a few exceptions where, 736 00:32:05,861 --> 00:32:08,889 if you look in the third row, third column, 737 00:32:08,889 --> 00:32:12,633 noise actually can make the model misclassify the example 738 00:32:12,633 --> 00:32:14,918 for especially large noise values. 739 00:32:14,918 --> 00:32:17,881 And there's even some where, 740 00:32:17,881 --> 00:32:20,227 in the top row there's one example you can see where 741 00:32:20,227 --> 00:32:23,553 the model is misclassifying the test example to start with 742 00:32:23,553 --> 00:32:26,745 but then noise can change it to be correctly classified. 743 00:32:26,745 --> 00:32:28,742 For the most part, noise has very little effect 744 00:32:28,742 --> 00:32:31,248 on the classification decision 745 00:32:31,248 --> 00:32:33,461 compared to adversarial examples. 746 00:32:33,461 --> 00:32:36,628 What's going on here is that in high dimensional spaces, 747 00:32:36,628 --> 00:32:38,860 if you choose some reference vector 748 00:32:38,860 --> 00:32:41,194 and then you choose a random vector 749 00:32:41,194 --> 00:32:42,873 in that high dimensional space, 750 00:32:42,873 --> 00:32:45,321 the random vector will, on average, 751 00:32:45,321 --> 00:32:49,982 have zero dot product with the reference vector. 752 00:32:49,982 --> 00:32:51,257 So if you think about making 753 00:32:51,257 --> 00:32:54,497 a first order Taylor series approximation of your cost, 754 00:32:54,497 --> 00:32:57,430 and thinking about how your Taylor series approximation 755 00:32:57,430 --> 00:33:00,852 predicts that random vectors will change your cost. 756 00:33:00,852 --> 00:33:02,580 You see that random vectors on average 757 00:33:02,580 --> 00:33:04,793 have no effect on the cost. 758 00:33:04,793 --> 00:33:08,960 But adversarial examples are chosen to maximize it. 759 00:33:10,246 --> 00:33:13,505 In these plots we looked in two dimensions. 760 00:33:13,505 --> 00:33:16,260 More recently, Florian Tramer here at Stanford 761 00:33:16,260 --> 00:33:17,720 got interested in finding out 762 00:33:17,720 --> 00:33:20,702 just how many dimensions there are to these subspaces 763 00:33:20,702 --> 00:33:22,702 where the adversarial examples 764 00:33:22,702 --> 00:33:25,908 lie in a thick contiguous region. 765 00:33:25,908 --> 00:33:28,716 And we came up with an algorithm together 766 00:33:28,716 --> 00:33:30,513 where you actually look for 767 00:33:30,513 --> 00:33:32,259 several different orthogonal vectors 768 00:33:32,259 --> 00:33:35,878 that all have a large dot product with the gradient. 769 00:33:35,878 --> 00:33:38,019 By looking in several different 770 00:33:38,019 --> 00:33:40,256 orthogonal directions simultaneously, 771 00:33:40,256 --> 00:33:42,684 we can map out this kind of polytope 772 00:33:42,684 --> 00:33:45,833 where many different adversarial examples live. 773 00:33:45,833 --> 00:33:47,974 We found out that this adversarial region 774 00:33:47,974 --> 00:33:51,592 has on average about 25 dimensions. 775 00:33:51,592 --> 00:33:53,389 If you look at different examples you'll find 776 00:33:53,389 --> 00:33:56,043 different numbers of adversarial dimensions. 777 00:33:56,043 --> 00:33:59,526 But on average on MNIST we found it was about 25. 778 00:33:59,526 --> 00:34:02,181 So what's interesting here is the dimensionality 779 00:34:02,181 --> 00:34:04,137 actually tells you something about 780 00:34:04,137 --> 00:34:06,782 how likely you are to find an adversarial example 781 00:34:06,782 --> 00:34:09,350 by generating random noise. 782 00:34:09,350 --> 00:34:12,288 If every direction were adversarial, 783 00:34:12,288 --> 00:34:15,657 then any change would cause a misclassification. 784 00:34:15,657 --> 00:34:17,692 If most of the directions were adversarial, 785 00:34:17,692 --> 00:34:20,443 then random directions would end up being adversarial 786 00:34:20,443 --> 00:34:22,731 just by accident most of the time. 787 00:34:22,731 --> 00:34:25,879 And then if there was only one adversarial direction, 788 00:34:25,879 --> 00:34:28,237 you'd almost never find that direction 789 00:34:28,237 --> 00:34:30,219 just by adding random noise. 790 00:34:30,219 --> 00:34:34,088 When there's 25 you have a chance of doing it sometimes. 791 00:34:34,089 --> 00:34:36,321 Another interesting thing is that different models 792 00:34:36,321 --> 00:34:39,724 will often misclassify the same adversarial examples. 793 00:34:39,724 --> 00:34:43,592 The subspace dimensionality of the adversarial subspace 794 00:34:43,592 --> 00:34:46,275 relates to that transfer property. 795 00:34:46,275 --> 00:34:48,992 The larger the dimensionality of the subspace, 796 00:34:48,993 --> 00:34:50,505 the more likely it is that the subspaces 797 00:34:50,505 --> 00:34:52,929 for two models will intersect. 798 00:34:52,929 --> 00:34:55,237 So if you have two different models 799 00:34:55,237 --> 00:34:57,220 that have a very large adversarial subspace, 800 00:34:57,220 --> 00:34:58,742 you know that you can probably transfer 801 00:34:58,742 --> 00:35:01,161 adversarial examples from one to the other. 802 00:35:01,161 --> 00:35:03,609 But if the adversarial subspace is very small, 803 00:35:03,609 --> 00:35:06,796 then unless there's some kind of really systematic effect 804 00:35:06,796 --> 00:35:09,603 forcing them to share exactly the same subspace, 805 00:35:09,603 --> 00:35:11,548 it seems less likely that you'll be able to transfer 806 00:35:11,548 --> 00:35:15,715 examples just due to the subspaces randomly aligning. 807 00:35:17,716 --> 00:35:20,563 A lot of the time in the adversarial example 808 00:35:20,563 --> 00:35:21,786 research community, 809 00:35:21,786 --> 00:35:25,080 we refer back to the story of Clever Hans. 810 00:35:25,080 --> 00:35:28,176 This comes from an essay by Bob Sturm called 811 00:35:28,176 --> 00:35:30,408 Clever Hans, Clever Algorithms. 812 00:35:30,408 --> 00:35:32,764 Because Clever Hans is a pretty good metaphor 813 00:35:32,764 --> 00:35:35,679 for what's happening with machine learning algorithms. 814 00:35:35,679 --> 00:35:39,446 So Clever Hans was a horse that lived in the early 1900s. 815 00:35:39,446 --> 00:35:43,171 His owner trained him to do arithmetic problems. 816 00:35:43,171 --> 00:35:45,494 So you could ask him, "Clever Hans, 817 00:35:45,494 --> 00:35:47,092 "what's two plus one?" 818 00:35:47,092 --> 00:35:50,425 And he would answer by tapping his hoof. 819 00:35:52,566 --> 00:35:54,873 And after the third tap, everybody would start 820 00:35:54,873 --> 00:35:56,976 cheering and clapping and looking excited 821 00:35:56,976 --> 00:35:59,958 because he'd actually done an arithmetic problem. 822 00:35:59,958 --> 00:36:01,151 Well it turned out that 823 00:36:01,151 --> 00:36:03,254 he hadn't actually learned to do arithmetic. 824 00:36:03,254 --> 00:36:05,256 But it was actually pretty hard to figure out 825 00:36:05,256 --> 00:36:06,638 what was going on. 826 00:36:06,638 --> 00:36:10,924 His owner was not trying to defraud anybody, 827 00:36:10,924 --> 00:36:13,588 his owner actually believed he could do arithmetic. 828 00:36:13,588 --> 00:36:15,782 And presumably Clever Hans himself 829 00:36:15,782 --> 00:36:18,067 was not trying to trick anybody. 830 00:36:18,067 --> 00:36:20,390 But eventually a psychologist examined him 831 00:36:20,390 --> 00:36:23,832 and found that if he was put in a room alone 832 00:36:23,832 --> 00:36:25,358 without an audience, 833 00:36:25,358 --> 00:36:29,137 and the person asking the questions wore a mask, 834 00:36:29,137 --> 00:36:31,156 he couldn't figure out when to stop tapping. 835 00:36:31,156 --> 00:36:32,505 You'd ask him, "Clever Hans, 836 00:36:32,505 --> 00:36:33,994 "what's one plus one?" 837 00:36:33,994 --> 00:36:37,411 And he'd just [knocking] 838 00:36:38,642 --> 00:36:40,084 keep staring at your face, waiting for you 839 00:36:40,084 --> 00:36:42,710 to give him some sign that he was done tapping. 840 00:36:42,710 --> 00:36:44,784 So everybody in this situation 841 00:36:44,784 --> 00:36:46,975 was trying to do the right thing. 842 00:36:46,975 --> 00:36:48,776 Clever Hans was trying to do whatever it took 843 00:36:48,776 --> 00:36:51,478 to get the apple that his owner would give him 844 00:36:51,478 --> 00:36:53,275 when he answered an arithmetic problem. 845 00:36:53,275 --> 00:36:56,155 His owner did his best to train him correctly 846 00:36:56,155 --> 00:36:57,861 with real arithmetic questions 847 00:36:57,861 --> 00:37:00,957 and real rewards for correct answers. 848 00:37:00,957 --> 00:37:03,787 And what happened was that Clever Hans 849 00:37:03,787 --> 00:37:07,118 inadvertently focused on the wrong cue. 850 00:37:07,118 --> 00:37:09,801 He found this cue of people's social reactions 851 00:37:09,801 --> 00:37:12,912 that could reliably help him solve the problem, 852 00:37:12,912 --> 00:37:15,231 but then it didn't generalize to a test set 853 00:37:15,231 --> 00:37:18,060 where you intentionally took that cue away. 854 00:37:18,060 --> 00:37:21,177 It did generalize to a naturally occurring test set, 855 00:37:21,177 --> 00:37:22,958 where he had an audience. 856 00:37:22,958 --> 00:37:24,633 So that's more or less what's happening 857 00:37:24,633 --> 00:37:26,289 with machine learning algorithms. 858 00:37:26,289 --> 00:37:28,305 They've found these very linear patterns 859 00:37:28,305 --> 00:37:30,590 that can fit the training data, 860 00:37:30,590 --> 00:37:34,384 and these linear patterns even generalize to the test data. 861 00:37:34,384 --> 00:37:36,907 They've learned to handle any example that comes from 862 00:37:36,907 --> 00:37:40,415 the same distribution as their training data. 863 00:37:40,415 --> 00:37:42,163 But then if you shift the distribution 864 00:37:42,163 --> 00:37:43,603 that you test them on, 865 00:37:43,603 --> 00:37:46,934 if a malicious adversary actually creates examples 866 00:37:46,934 --> 00:37:48,570 that are intended to fool them, 867 00:37:48,570 --> 00:37:50,820 they're very easily fooled. 868 00:37:51,686 --> 00:37:54,316 In fact we find that modern machine learning algorithms 869 00:37:54,316 --> 00:37:56,726 are wrong almost everywhere. 870 00:37:56,726 --> 00:37:59,606 We tend to think of them as being correct most of the time, 871 00:37:59,606 --> 00:38:02,073 because when we run them on naturally occurring inputs 872 00:38:02,073 --> 00:38:06,048 they achieve very high accuracy percentages. 873 00:38:06,048 --> 00:38:08,440 But if we look instead of as the percentage 874 00:38:08,440 --> 00:38:11,107 of samples from an IID test set, 875 00:38:12,007 --> 00:38:15,628 if we look at the percentage of the space in RN 876 00:38:15,628 --> 00:38:17,655 that is correctly classified, 877 00:38:17,655 --> 00:38:20,649 we find that they misclassify almost everything 878 00:38:20,649 --> 00:38:24,158 and they behave reasonably only on a very thin manifold 879 00:38:24,158 --> 00:38:27,489 surrounding the data that we train them on. 880 00:38:27,489 --> 00:38:30,187 In this plot, I show you several different examples 881 00:38:30,187 --> 00:38:32,006 of Gaussian noise 882 00:38:32,006 --> 00:38:35,075 that I've run through a CIFAR-10 classifier. 883 00:38:35,075 --> 00:38:37,100 Everywhere that there is a pink box, 884 00:38:37,100 --> 00:38:39,213 the classifier thinks that there is something 885 00:38:39,213 --> 00:38:40,780 rather than nothing. 886 00:38:40,780 --> 00:38:43,030 I'll come back to what that means in a second. 887 00:38:43,030 --> 00:38:45,227 Everywhere that there is a yellow box, 888 00:38:45,227 --> 00:38:47,622 one step of the fast gradient sign method 889 00:38:47,622 --> 00:38:50,132 was able to persuade the model that it was looking 890 00:38:50,132 --> 00:38:52,395 specifically at an airplane. 891 00:38:52,395 --> 00:38:53,731 I chose the airplane class 892 00:38:53,731 --> 00:38:56,254 because it was the one with the lowest success rate. 893 00:38:56,254 --> 00:38:58,671 It had about a 25% success rate. 894 00:38:58,671 --> 00:39:01,898 That means an attacker would need four chances 895 00:39:01,898 --> 00:39:06,291 to get noise recognized as an airplane on this model. 896 00:39:06,291 --> 00:39:08,494 An interesting thing, and appropriate enough 897 00:39:08,494 --> 00:39:09,994 given the story of Clever Hans, 898 00:39:09,994 --> 00:39:12,903 is that this model found that about 70% of RN 899 00:39:12,903 --> 00:39:15,070 was classified as a horse. 900 00:39:17,510 --> 00:39:20,194 So I mentioned that this model will say 901 00:39:20,194 --> 00:39:22,606 that noise is something rather than nothing. 902 00:39:22,606 --> 00:39:24,450 And it's actually kind of important to think about 903 00:39:24,450 --> 00:39:26,401 how we evaluate that. 904 00:39:26,401 --> 00:39:28,498 If you have a softmax classifier, 905 00:39:28,498 --> 00:39:30,529 it has to give you a distribution 906 00:39:30,529 --> 00:39:34,158 over the n different classes that you train it on. 907 00:39:34,158 --> 00:39:35,825 So there's a few ways that you can argue 908 00:39:35,825 --> 00:39:37,119 that the model is telling you 909 00:39:37,119 --> 00:39:39,138 that there's something rather than nothing. 910 00:39:39,138 --> 00:39:42,026 One is you can say, if it assigns something like 90% 911 00:39:42,026 --> 00:39:43,698 to one particular class, 912 00:39:43,698 --> 00:39:46,373 that seems to be voting for that class being there. 913 00:39:46,373 --> 00:39:47,705 We'd much rather see it give us 914 00:39:47,705 --> 00:39:50,018 something like a uniform distribution saying 915 00:39:50,018 --> 00:39:52,833 this noise doesn't look like anything in the training set 916 00:39:52,833 --> 00:39:56,177 so it's equally likely to be a horse or a car. 917 00:39:56,177 --> 00:39:58,075 And that's not what the model does. 918 00:39:58,075 --> 00:40:01,028 It'll say, this is very definitely a horse. 919 00:40:01,028 --> 00:40:03,395 Another thing that you can do is you can replace 920 00:40:03,395 --> 00:40:05,186 the last layer of the model. 921 00:40:05,186 --> 00:40:10,009 For example, you can use a sigmoid output for each class. 922 00:40:10,009 --> 00:40:11,754 And then the model is actually capable of telling you 923 00:40:11,754 --> 00:40:14,407 that any subset of classes is present. 924 00:40:14,407 --> 00:40:15,777 It could actually tell you that an image 925 00:40:15,777 --> 00:40:17,250 is both a horse and a car. 926 00:40:17,250 --> 00:40:19,292 And what we would like it to do for the noise 927 00:40:19,292 --> 00:40:21,962 is tell us that none of the classes is present, 928 00:40:21,962 --> 00:40:23,585 that all of the sigmoids should have a value 929 00:40:23,585 --> 00:40:25,346 of less than 1/2. 930 00:40:25,346 --> 00:40:29,479 And 1/2 isn't even particularly a low threshold. 931 00:40:29,479 --> 00:40:32,034 We could reasonably expect that all of the sigmoids would be 932 00:40:32,034 --> 00:40:35,982 less than 0.01 for such a defective input as this. 933 00:40:35,982 --> 00:40:38,226 But what we find instead is that the sigmoids 934 00:40:38,226 --> 00:40:40,177 tend to have at least one class present 935 00:40:40,177 --> 00:40:42,122 just when we run Gaussian noise 936 00:40:42,122 --> 00:40:45,205 of sufficient norm through the model. 937 00:40:48,050 --> 00:40:50,269 We've also found that we can do adversarial examples 938 00:40:50,269 --> 00:40:51,946 for reinforcement learning. 939 00:40:51,946 --> 00:40:53,329 And there's a video for this. 940 00:40:53,329 --> 00:40:54,946 I'll upload the slides after the talk 941 00:40:54,946 --> 00:40:56,202 and you can follow the link. 942 00:40:56,202 --> 00:40:58,082 Unfortunately I wasn't able to get the WiFi to work 943 00:40:58,082 --> 00:41:00,245 so I can't show you the video animated. 944 00:41:00,245 --> 00:41:01,482 But I can describe basically what's going on 945 00:41:01,482 --> 00:41:03,232 from this still here. 946 00:41:05,258 --> 00:41:08,149 There's a game Seaquest on Atari 947 00:41:08,149 --> 00:41:09,897 where you can train reinforcement learning agents 948 00:41:09,897 --> 00:41:11,110 to play that game. 949 00:41:11,110 --> 00:41:14,270 And you can take the raw input pixels 950 00:41:14,270 --> 00:41:18,242 and you can take the fast gradient sign method 951 00:41:18,242 --> 00:41:21,642 or other attacks that use other norms besides the max norm, 952 00:41:21,642 --> 00:41:24,586 and compute perturbations that are intended 953 00:41:24,586 --> 00:41:27,646 to change the action that the policy would select. 954 00:41:27,646 --> 00:41:29,566 So the reinforcement learning policy, 955 00:41:29,566 --> 00:41:31,350 you can think of it as just being like a classifier 956 00:41:31,350 --> 00:41:33,211 that looks at a frame. 957 00:41:33,211 --> 00:41:35,550 And instead of categorizing the input 958 00:41:35,550 --> 00:41:37,126 into a particular category, 959 00:41:37,126 --> 00:41:40,753 it gives you a softmax distribution over actions to take. 960 00:41:40,753 --> 00:41:43,427 So if we just take that and say that the most likely action 961 00:41:43,427 --> 00:41:47,482 should have its accuracy be decreased by the adversary. 962 00:41:47,482 --> 00:41:49,261 Sorry, to have its probability 963 00:41:49,261 --> 00:41:51,034 be decreased by the adversary, 964 00:41:51,034 --> 00:41:53,030 you'll get these perturbations of input frames 965 00:41:53,030 --> 00:41:55,762 that you can then apply and cause the agent 966 00:41:55,762 --> 00:41:58,670 to play different actions than it would have otherwise. 967 00:41:58,670 --> 00:42:00,268 And using this you can make the agent 968 00:42:00,268 --> 00:42:02,851 play Seaquest very, very badly. 969 00:42:03,786 --> 00:42:06,179 It's maybe not the most interesting possible thing. 970 00:42:06,179 --> 00:42:07,767 What we'd really like is an environment 971 00:42:07,767 --> 00:42:09,993 where there are many different reward functions available 972 00:42:09,993 --> 00:42:11,238 for us to study. 973 00:42:11,238 --> 00:42:14,071 So for example, if you had a robot 974 00:42:15,092 --> 00:42:17,579 that was intended to cook scrambled eggs, 975 00:42:17,579 --> 00:42:18,865 and you had a reward function measuring 976 00:42:18,865 --> 00:42:20,610 how well it's cooking scrambled eggs, 977 00:42:20,610 --> 00:42:22,397 and you had another reward function 978 00:42:22,397 --> 00:42:25,649 measuring how well it's cooking chocolate cake, 979 00:42:25,649 --> 00:42:27,849 it would be really interesting if we could make 980 00:42:27,849 --> 00:42:29,925 adversarial examples that cause the robot 981 00:42:29,925 --> 00:42:31,501 to make a chocolate cake 982 00:42:31,501 --> 00:42:35,017 when the user intended for it to make scrambled eggs. 983 00:42:35,017 --> 00:42:37,581 That's because it's very difficult to succeed at something 984 00:42:37,581 --> 00:42:40,393 and it's relatively straightforward to make a system fail. 985 00:42:40,393 --> 00:42:42,400 So right now, adversarial examples for RL 986 00:42:42,400 --> 00:42:45,049 are very good at showing that we can make RL agents fail. 987 00:42:45,049 --> 00:42:47,827 But we haven't yet been able to hijack them 988 00:42:47,827 --> 00:42:49,229 and make them do a complicated task 989 00:42:49,229 --> 00:42:51,429 that's different from what their owner intended. 990 00:42:51,429 --> 00:42:53,405 Seems like it's one of the next steps 991 00:42:53,405 --> 00:42:56,655 in adversarial example research though. 992 00:42:58,101 --> 00:43:01,078 If we look at high-dimension linear models, 993 00:43:01,078 --> 00:43:02,479 we can actually see that a lot of this 994 00:43:02,479 --> 00:43:04,682 is very simple and straightforward. 995 00:43:04,682 --> 00:43:07,585 Here we have a logistic regression model 996 00:43:07,585 --> 00:43:10,385 that classifies sevens and threes. 997 00:43:10,385 --> 00:43:13,665 So the whole model can be described just by a weight vector 998 00:43:13,665 --> 00:43:16,807 and a single scalar bias term. 999 00:43:16,807 --> 00:43:20,404 We don't really need to see the bias term for this exercise. 1000 00:43:20,404 --> 00:43:22,063 If you look on the left I've plotted the weights 1001 00:43:22,063 --> 00:43:24,929 that we used to discriminate sevens and threes. 1002 00:43:24,929 --> 00:43:27,505 The weights should look a little bit like the difference 1003 00:43:27,505 --> 00:43:30,098 between the average seven and the average three. 1004 00:43:30,098 --> 00:43:31,505 And then down at the bottom we've taken 1005 00:43:31,505 --> 00:43:33,225 the sign of the weights. 1006 00:43:33,225 --> 00:43:35,764 So the gradient for a logistic regression model 1007 00:43:35,764 --> 00:43:38,529 is going to be proportional to the weights. 1008 00:43:38,529 --> 00:43:41,505 And then the sign of the weights gives you 1009 00:43:41,505 --> 00:43:43,981 essentially the sign of the gradient. 1010 00:43:43,981 --> 00:43:46,268 So we can do the fast gradient sign method 1011 00:43:46,268 --> 00:43:49,955 to attack this model just by looking at its weights. 1012 00:43:49,955 --> 00:43:52,619 In the examples in the panel 1013 00:43:52,619 --> 00:43:54,327 that's the second column from the left 1014 00:43:54,327 --> 00:43:55,981 we can see clean examples. 1015 00:43:55,981 --> 00:43:58,302 And then on the right we've just added or subtracted 1016 00:43:58,302 --> 00:44:00,900 this image of the sign of the weights off of them. 1017 00:44:00,900 --> 00:44:03,515 To you and me as human observers, 1018 00:44:03,515 --> 00:44:06,871 the sign of the weights is just like garbage 1019 00:44:06,871 --> 00:44:08,204 that's in the background, 1020 00:44:08,204 --> 00:44:09,743 and we more or less filter it out. 1021 00:44:09,743 --> 00:44:11,868 It doesn't look particularly interesting to us. 1022 00:44:11,868 --> 00:44:14,364 It doesn't grab our attention. 1023 00:44:14,364 --> 00:44:16,001 To the logistic regression model 1024 00:44:16,001 --> 00:44:17,607 this image of the sign of the weights 1025 00:44:17,607 --> 00:44:20,449 is the most salient thing 1026 00:44:20,449 --> 00:44:22,791 that could ever appear in the image. 1027 00:44:22,791 --> 00:44:24,567 When it's positive it looks like 1028 00:44:24,567 --> 00:44:26,748 the world's most quintessential seven. 1029 00:44:26,748 --> 00:44:27,959 When it's negative it looks like 1030 00:44:27,959 --> 00:44:29,684 the world's most quintessential three. 1031 00:44:29,684 --> 00:44:31,127 And so the model makes its decision 1032 00:44:31,127 --> 00:44:33,242 almost entirely based on this perturbation 1033 00:44:33,242 --> 00:44:37,409 we added to the image, rather than on the background. 1034 00:44:38,498 --> 00:44:40,007 You could also take this same procedure, 1035 00:44:40,007 --> 00:44:44,174 and my colleague Andrej at OpenAI showed how you can 1036 00:44:45,271 --> 00:44:49,063 modify the image on ImageNet using this same approach, 1037 00:44:49,063 --> 00:44:51,706 and turn this goldfish into a daisy. 1038 00:44:51,706 --> 00:44:53,831 Because ImageNet is much higher dimensional, 1039 00:44:53,831 --> 00:44:56,769 you don't need to use quite as large of a coefficient 1040 00:44:56,769 --> 00:44:58,761 on the image of the weights. 1041 00:44:58,761 --> 00:45:03,226 So we can make a more persuasive fooling attack. 1042 00:45:03,226 --> 00:45:05,249 You can see that this same image of the weights, 1043 00:45:05,249 --> 00:45:08,631 when applied to any different input image, 1044 00:45:08,631 --> 00:45:12,231 will actually reliably cause a misclassification. 1045 00:45:12,231 --> 00:45:14,951 What's going on is that there are many different classes, 1046 00:45:14,951 --> 00:45:18,822 and it means that if you choose the weights 1047 00:45:18,822 --> 00:45:20,504 for any particular class, 1048 00:45:20,504 --> 00:45:23,364 it's very unlikely that a new test image 1049 00:45:23,364 --> 00:45:25,642 will belong to that class. 1050 00:45:25,642 --> 00:45:27,349 So on ImageNet, if we're using 1051 00:45:27,349 --> 00:45:29,351 the weights for the daisy class, 1052 00:45:29,351 --> 00:45:31,431 and there are 1,000 different classes, 1053 00:45:31,431 --> 00:45:33,628 then we have about a 99.9% chance 1054 00:45:33,628 --> 00:45:36,122 that a test image will not be a daisy. 1055 00:45:36,122 --> 00:45:37,767 If we then go ahead and add the weights 1056 00:45:37,767 --> 00:45:39,809 for the daisy class to that image, 1057 00:45:39,809 --> 00:45:41,889 then we get a daisy, and because that's not 1058 00:45:41,889 --> 00:45:45,207 the correct class, it's a misclassification. 1059 00:45:45,207 --> 00:45:47,068 So there's a paper at CVPR this year 1060 00:45:47,068 --> 00:45:48,748 called Universal Adversarial Perturbations 1061 00:45:48,748 --> 00:45:51,287 that expands a lot more on this observation 1062 00:45:51,287 --> 00:45:53,799 that we had going back in 2014. 1063 00:45:53,799 --> 00:45:56,647 But basically these weight vectors, 1064 00:45:56,647 --> 00:45:59,031 when applied to many different images, 1065 00:45:59,031 --> 00:46:02,614 can cause misclassification in all of them. 1066 00:46:04,647 --> 00:46:06,303 I've spent a lot of time telling you 1067 00:46:06,303 --> 00:46:08,508 that these linear models are just terrible, 1068 00:46:08,508 --> 00:46:11,269 and at some point you've probably been hoping 1069 00:46:11,269 --> 00:46:13,089 I would give you some sort of a control experiment 1070 00:46:13,089 --> 00:46:15,468 to convince you that there's another model 1071 00:46:15,468 --> 00:46:16,988 that's not terrible. 1072 00:46:16,988 --> 00:46:19,351 So it turns out that some quadratic models 1073 00:46:19,351 --> 00:46:21,249 actually perform really well. 1074 00:46:21,249 --> 00:46:23,927 In particular a shallow RBF network 1075 00:46:23,927 --> 00:46:27,687 is able to resist adversarial perturbations very well. 1076 00:46:27,687 --> 00:46:29,047 Earlier I showed you an animation 1077 00:46:29,047 --> 00:46:30,522 where I took a nine and I turned it into 1078 00:46:30,522 --> 00:46:32,108 a zero, one, two, and so on, 1079 00:46:32,108 --> 00:46:34,884 without really changing its appearance at all. 1080 00:46:34,884 --> 00:46:36,028 And I was able to fool 1081 00:46:36,028 --> 00:46:39,329 a linear softmax regression classifier. 1082 00:46:39,329 --> 00:46:40,947 Here I've got an RBF network 1083 00:46:40,947 --> 00:46:43,384 where it outputs a separate probability 1084 00:46:43,384 --> 00:46:45,388 of each class being absent or present, 1085 00:46:45,388 --> 00:46:49,555 and that probability is given by e to the negative square 1086 00:46:51,111 --> 00:46:53,271 of the difference between a template image 1087 00:46:53,271 --> 00:46:55,489 and the input image. 1088 00:46:55,489 --> 00:46:59,108 And if we actually follow the gradient of this classifier, 1089 00:46:59,108 --> 00:47:01,903 it does actually turn the image into 1090 00:47:01,903 --> 00:47:04,801 a zero, a one, a two, a three, and so on, 1091 00:47:04,801 --> 00:47:07,249 and we can actually recognize those changes. 1092 00:47:07,249 --> 00:47:09,649 The problem is, this classifier does not get 1093 00:47:09,649 --> 00:47:12,164 very good accuracy on the training set. 1094 00:47:12,164 --> 00:47:13,767 It's a shallow model. 1095 00:47:13,767 --> 00:47:15,503 It's basically just a template matcher. 1096 00:47:15,503 --> 00:47:17,511 It is literally a template matcher. 1097 00:47:17,511 --> 00:47:20,689 And if you try to make it more sophisticated 1098 00:47:20,689 --> 00:47:22,049 by making it deeper, 1099 00:47:22,049 --> 00:47:26,216 it turns out that the gradient of these RBF units is zero, 1100 00:47:27,648 --> 00:47:30,762 or very near zero, throughout most of RN. 1101 00:47:30,762 --> 00:47:32,769 So they're extremely difficult to train, 1102 00:47:32,769 --> 00:47:36,289 even with batch normalization and methods like that. 1103 00:47:36,289 --> 00:47:39,727 I haven't managed to train a deep RBF network yet. 1104 00:47:39,727 --> 00:47:42,748 But I think if somebody comes up with better hyperparameters 1105 00:47:42,748 --> 00:47:46,102 or a new, more powerful optimization algorithm, 1106 00:47:46,102 --> 00:47:47,489 it might be possible to solve 1107 00:47:47,489 --> 00:47:49,344 the adversarial example problem 1108 00:47:49,344 --> 00:47:51,489 by training a deep RBF network 1109 00:47:51,489 --> 00:47:55,985 where the model is so nonlinear and has such wide flat areas 1110 00:47:55,985 --> 00:47:59,409 that the adversary is not able to push the cost uphill 1111 00:47:59,409 --> 00:48:03,576 just by making small changes to the model's input. 1112 00:48:05,242 --> 00:48:06,887 One of the things that's the most alarming 1113 00:48:06,887 --> 00:48:08,209 about adversarial examples 1114 00:48:08,209 --> 00:48:11,649 is that they generalize from one dataset to another 1115 00:48:11,649 --> 00:48:13,468 and one model to another. 1116 00:48:13,468 --> 00:48:15,329 Here I've trained two different models 1117 00:48:15,329 --> 00:48:17,478 on two different training sets. 1118 00:48:17,478 --> 00:48:20,145 The training sets are tiny in both cases. 1119 00:48:20,145 --> 00:48:23,425 It's just MNIST three versus seven classification, 1120 00:48:23,425 --> 00:48:26,696 and this is really just for the purpose of making a slide. 1121 00:48:26,696 --> 00:48:29,207 If you train a logistic regression model 1122 00:48:29,207 --> 00:48:32,644 on the digits shown in the left panel, 1123 00:48:32,644 --> 00:48:35,903 you get the weights shown on the left in the lower panel. 1124 00:48:35,903 --> 00:48:37,585 If you train a logistic regression model 1125 00:48:37,585 --> 00:48:39,729 on the digits shown in the upper right, 1126 00:48:39,729 --> 00:48:42,564 you get the weights shown on the right in the lower panel. 1127 00:48:42,564 --> 00:48:44,225 So you've got two different training sets 1128 00:48:44,225 --> 00:48:45,619 and we learn weight vectors that look 1129 00:48:45,619 --> 00:48:47,143 very similar to each other. 1130 00:48:47,143 --> 00:48:50,080 That's just because machine learning algorithms generalize. 1131 00:48:50,080 --> 00:48:51,884 You want them to learn a function that's 1132 00:48:51,884 --> 00:48:54,740 somewhat independent of the data that you train them on. 1133 00:48:54,740 --> 00:48:55,879 It shouldn't matter which particular 1134 00:48:55,879 --> 00:48:57,884 training examples you choose. 1135 00:48:57,884 --> 00:48:58,924 If you want to generalize 1136 00:48:58,924 --> 00:49:00,545 from the training set to the test set, 1137 00:49:00,545 --> 00:49:02,781 you've also got to expect that different training sets 1138 00:49:02,781 --> 00:49:05,002 will give you more or less the same result. 1139 00:49:05,002 --> 00:49:06,583 And that means that because they've learned 1140 00:49:06,583 --> 00:49:08,340 more or less similar functions, 1141 00:49:08,340 --> 00:49:13,237 they're vulnerable to similar adversarial examples. 1142 00:49:13,237 --> 00:49:15,723 An adversary can compute an image that fools one 1143 00:49:15,723 --> 00:49:18,461 and use it to fool the other. 1144 00:49:18,461 --> 00:49:20,738 In fact we can actually go ahead and measure 1145 00:49:20,738 --> 00:49:22,386 the transfer rate between 1146 00:49:22,386 --> 00:49:24,684 several different machine learning techniques, 1147 00:49:24,684 --> 00:49:27,154 not just different data sets. 1148 00:49:27,154 --> 00:49:28,881 Nicolas Papernot and his collaborators 1149 00:49:28,881 --> 00:49:30,799 have spent a lot of time exploring 1150 00:49:30,799 --> 00:49:32,718 this transferability effect. 1151 00:49:32,718 --> 00:49:35,965 And they found that for example, 1152 00:49:35,965 --> 00:49:38,200 logistic regression makes adversarial examples 1153 00:49:38,200 --> 00:49:42,367 that transfer to decision trees with 87.4% probability. 1154 00:49:43,999 --> 00:49:48,058 Wherever you see dark squares in this matrix, 1155 00:49:48,058 --> 00:49:50,823 that shows that there's a high amount of transfer. 1156 00:49:50,823 --> 00:49:53,225 That means that it's very possible for an attacker 1157 00:49:53,225 --> 00:49:55,475 using the model on the left 1158 00:49:56,380 --> 00:50:00,547 to create adversarial examples for the model on the right. 1159 00:50:01,578 --> 00:50:03,324 The procedure overall is that, 1160 00:50:03,324 --> 00:50:05,100 suppose the attacker wants to fool a model 1161 00:50:05,100 --> 00:50:07,863 that they don't actually have access to. 1162 00:50:07,863 --> 00:50:10,364 They don't know the architecture that's used 1163 00:50:10,364 --> 00:50:11,783 to train the model. 1164 00:50:11,783 --> 00:50:13,770 They may not even know which algorithm is being used. 1165 00:50:13,770 --> 00:50:15,198 They may not know whether they're attacking 1166 00:50:15,198 --> 00:50:17,260 a decision tree or a deep neural net. 1167 00:50:17,260 --> 00:50:20,540 And they also don't know the parameters 1168 00:50:20,540 --> 00:50:23,303 of the model that they're going to attack. 1169 00:50:23,303 --> 00:50:26,089 So what they can do is train their own model 1170 00:50:26,089 --> 00:50:29,172 that they'll use to build the attack. 1171 00:50:30,272 --> 00:50:32,175 There's two different ways you can train your own model. 1172 00:50:32,175 --> 00:50:33,703 One is you can label your own training set 1173 00:50:33,703 --> 00:50:36,620 for the same task that you want to attack. 1174 00:50:36,620 --> 00:50:39,802 Say that somebody is using an ImageNet classifier, 1175 00:50:39,802 --> 00:50:42,924 and for whatever reason you don't have access to ImageNet, 1176 00:50:42,924 --> 00:50:44,797 you can take your own photos and label them, 1177 00:50:44,797 --> 00:50:46,939 train your own object recognizer. 1178 00:50:46,939 --> 00:50:48,620 It's going to share adversarial examples 1179 00:50:48,620 --> 00:50:50,700 with an ImageNet model. 1180 00:50:50,700 --> 00:50:52,384 The other thing you can do is, 1181 00:50:52,384 --> 00:50:55,361 say that you can't afford to gather your own training set. 1182 00:50:55,361 --> 00:50:57,420 What you can do instead is if you can get 1183 00:50:57,420 --> 00:50:59,041 limited access to the model 1184 00:50:59,041 --> 00:51:02,236 where you just have the ability to send inputs to the model 1185 00:51:02,236 --> 00:51:03,804 and observe its outputs, 1186 00:51:03,804 --> 00:51:06,700 then you can send those inputs, observe the outputs, 1187 00:51:06,700 --> 00:51:09,361 and use those as your training set. 1188 00:51:09,361 --> 00:51:11,201 This'll work even if the output 1189 00:51:11,201 --> 00:51:12,740 that you get from the target model 1190 00:51:12,740 --> 00:51:15,943 is only the class label that it chooses. 1191 00:51:15,943 --> 00:51:17,882 A lot of people read this and assume that 1192 00:51:17,882 --> 00:51:19,004 you need to have access 1193 00:51:19,004 --> 00:51:21,244 to all the probability values it outputs. 1194 00:51:21,244 --> 00:51:24,975 But even just the class labels are sufficient. 1195 00:51:24,975 --> 00:51:26,684 So once you've used one of these two methods, 1196 00:51:26,684 --> 00:51:28,204 either gather your own training set 1197 00:51:28,204 --> 00:51:31,324 or observing the outputs of a target model, 1198 00:51:31,324 --> 00:51:32,877 you can train your own model 1199 00:51:32,877 --> 00:51:36,444 and then make adversarial examples for your model. 1200 00:51:36,444 --> 00:51:38,823 Those adversarial examples are very likely to transfer 1201 00:51:38,823 --> 00:51:41,178 and affect the target model. 1202 00:51:41,178 --> 00:51:43,736 So you can then go and send those out and fool it, 1203 00:51:43,736 --> 00:51:47,569 even if you didn't have access to it directly. 1204 00:51:48,513 --> 00:51:50,503 We've also measured the transferability 1205 00:51:50,503 --> 00:51:52,360 across different data sets, 1206 00:51:52,360 --> 00:51:54,583 and for most models we find that they're 1207 00:51:54,583 --> 00:51:56,204 kind of in an intermediate zone 1208 00:51:56,204 --> 00:51:58,103 where different data sets will result 1209 00:51:58,103 --> 00:52:01,476 in a transfer rate of, like, 60% to 80%. 1210 00:52:01,476 --> 00:52:04,001 There's a few models like SVMs that are very data dependent 1211 00:52:04,001 --> 00:52:08,103 because SVMs end up focusing on a very small subset 1212 00:52:08,103 --> 00:52:10,941 of the training data to form their final decision boundary. 1213 00:52:10,941 --> 00:52:12,744 But most models that we care about 1214 00:52:12,744 --> 00:52:15,994 are somewhere in the intermediate zone. 1215 00:52:17,444 --> 00:52:19,554 Now that's just assuming that you rely 1216 00:52:19,554 --> 00:52:22,596 on the transfer happening naturally. 1217 00:52:22,596 --> 00:52:23,879 You make an adversarial example 1218 00:52:23,879 --> 00:52:26,740 and you hope that it will transfer to your target. 1219 00:52:26,740 --> 00:52:30,353 What if you do something to stack the deck in your favor 1220 00:52:30,353 --> 00:52:33,211 and improve the odds that you'll get 1221 00:52:33,211 --> 00:52:35,860 your adversarial examples to transfer? 1222 00:52:35,860 --> 00:52:38,937 Dawn Song's group at UC Berkeley studied this. 1223 00:52:38,937 --> 00:52:43,060 They found that if they take an ensemble of different models 1224 00:52:43,060 --> 00:52:46,078 and they use gradient descent to search for 1225 00:52:46,078 --> 00:52:47,998 an adversarial example that will fool 1226 00:52:47,998 --> 00:52:50,297 every member of their ensemble, 1227 00:52:50,297 --> 00:52:53,337 then it's extremely likely that it will transfer 1228 00:52:53,337 --> 00:52:56,958 and fool a new machine learning model. 1229 00:52:56,958 --> 00:52:59,131 So if you have an ensemble of five models, 1230 00:52:59,131 --> 00:53:00,315 you can get it to the point where 1231 00:53:00,315 --> 00:53:02,596 there's essentially a 100% chance 1232 00:53:02,596 --> 00:53:04,654 that you'll fool a sixth model 1233 00:53:04,654 --> 00:53:07,249 out of the set of models that they compared. 1234 00:53:07,249 --> 00:53:09,881 They looked at things like ResNets of different depths, 1235 00:53:09,881 --> 00:53:11,464 VGG, and GoogLeNet. 1236 00:53:12,752 --> 00:53:16,055 So in the labels for each of the different rows 1237 00:53:16,055 --> 00:53:18,201 you can see that they made ensembles that lacked 1238 00:53:18,201 --> 00:53:19,835 each of these different models, 1239 00:53:19,835 --> 00:53:23,321 and then they would test it on the different target models. 1240 00:53:23,321 --> 00:53:28,137 So like if you make an ensemble that omits GoogLeNet, 1241 00:53:28,137 --> 00:53:32,076 you have only about a 5% chance of GoogLeNet 1242 00:53:32,076 --> 00:53:34,521 correctly classifying the adversarial example 1243 00:53:34,521 --> 00:53:37,023 you make for that ensemble. 1244 00:53:37,023 --> 00:53:40,507 If you make an ensemble that omits ResNet-152, 1245 00:53:40,507 --> 00:53:42,353 in their experiments they found that 1246 00:53:42,353 --> 00:53:46,520 there was a 0% chance of ResNet-152 resisting that attack. 1247 00:53:48,531 --> 00:53:50,337 That probably indicates they should have run 1248 00:53:50,337 --> 00:53:52,004 some more adversarial examples 1249 00:53:52,004 --> 00:53:54,697 until they found a non-zero success rate, 1250 00:53:54,697 --> 00:53:57,969 but it does show that the attack is very powerful. 1251 00:53:57,969 --> 00:53:59,770 And then when you go look into 1252 00:53:59,770 --> 00:54:01,713 intentionally cause the transfer effect, 1253 00:54:01,713 --> 00:54:04,713 you can really make it quite strong. 1254 00:54:05,872 --> 00:54:08,241 A lot of people often ask me if the human brain 1255 00:54:08,241 --> 00:54:10,808 is vulnerable to adversarial examples. 1256 00:54:10,808 --> 00:54:14,436 And for this lecture I can't use copyrighted material, 1257 00:54:14,436 --> 00:54:17,360 but there's some really hilarious things on the Internet 1258 00:54:17,360 --> 00:54:19,693 if you go looking for, like, 1259 00:54:21,329 --> 00:54:23,833 the fake CAPTCHA with images of Mark Hamill, 1260 00:54:23,833 --> 00:54:27,214 you'll find something that my perception system 1261 00:54:27,214 --> 00:54:29,015 definitely can't handle. 1262 00:54:29,015 --> 00:54:31,708 So here's another one that's actually published 1263 00:54:31,708 --> 00:54:35,577 with a license where I was confident I'm allowed to use it. 1264 00:54:35,577 --> 00:54:38,473 You can look at this image of different circles here, 1265 00:54:38,473 --> 00:54:42,217 and they appear to be intertwined spirals. 1266 00:54:42,217 --> 00:54:45,210 But in fact they are concentric circles. 1267 00:54:45,210 --> 00:54:47,521 The orientation of the edges of the squares 1268 00:54:47,521 --> 00:54:51,177 is interfering with the edge detectors in your brain, 1269 00:54:51,177 --> 00:54:55,468 making it look like the circles are spiraling. 1270 00:54:55,468 --> 00:54:57,372 So you can think of these optical illusions 1271 00:54:57,372 --> 00:54:59,847 as being adversarial examples in the human brain. 1272 00:54:59,847 --> 00:55:01,908 What's interesting is that we don't seem to share 1273 00:55:01,908 --> 00:55:03,589 many adversarial examples in common 1274 00:55:03,589 --> 00:55:05,732 with machine learning models. 1275 00:55:05,732 --> 00:55:08,174 Adversarial examples transfer extremely reliably 1276 00:55:08,174 --> 00:55:09,970 between different machine learning models, 1277 00:55:09,970 --> 00:55:11,956 especially if you use that ensemble trick 1278 00:55:11,956 --> 00:55:15,492 that was developed at UC Berkeley. 1279 00:55:15,492 --> 00:55:18,654 But those adversarial examples don't fool us. 1280 00:55:18,654 --> 00:55:20,212 It tells us that we must be using 1281 00:55:20,212 --> 00:55:22,436 a very different algorithm or model family 1282 00:55:22,436 --> 00:55:25,417 than current convolutional networks. 1283 00:55:25,417 --> 00:55:27,273 We don't really know what the difference is yet, 1284 00:55:27,273 --> 00:55:30,023 but it would be very interesting to figure that out. 1285 00:55:30,023 --> 00:55:32,953 It seems to suggest that studying adversarial examples 1286 00:55:32,953 --> 00:55:35,353 could tell us how to significantly improve 1287 00:55:35,353 --> 00:55:37,854 our existing machine learning models. 1288 00:55:37,854 --> 00:55:40,413 Even if you don't care about having an adversary, 1289 00:55:40,413 --> 00:55:43,113 we might figure out something or other about 1290 00:55:43,113 --> 00:55:45,111 how to make machine learning algorithms 1291 00:55:45,111 --> 00:55:48,116 deal with ambiguity and unexpected inputs 1292 00:55:48,116 --> 00:55:50,033 more like a human does. 1293 00:55:52,106 --> 00:55:55,594 If we actually want to go out and do attacks in practice, 1294 00:55:55,594 --> 00:56:00,276 there's started to be a body of research on this subject. 1295 00:56:00,276 --> 00:56:03,060 Nicolas Papernot showed that he could use 1296 00:56:03,060 --> 00:56:05,897 the transfer effect to fool classifiers 1297 00:56:05,897 --> 00:56:09,177 hosted by MetaMind, Amazon, and Google. 1298 00:56:09,177 --> 00:56:11,452 So these are all just different machine learning APIs 1299 00:56:11,452 --> 00:56:13,755 where you can upload a dataset 1300 00:56:13,755 --> 00:56:16,275 and the API will train the model for you. 1301 00:56:16,275 --> 00:56:19,038 And then you don't actually know, in most cases, 1302 00:56:19,038 --> 00:56:21,316 which model is trained for you. 1303 00:56:21,316 --> 00:56:23,714 You don't have access to its weights or anything like that. 1304 00:56:23,714 --> 00:56:26,168 So Nicolas would train his own copy of the model 1305 00:56:26,168 --> 00:56:27,553 using the API, 1306 00:56:27,553 --> 00:56:31,256 and then build a model on his own personal desktop 1307 00:56:31,256 --> 00:56:34,169 where he could fool the API hosted model. 1308 00:56:34,169 --> 00:56:36,917 Later, Berkeley showed you could fool Clarifai in this way. 1309 00:56:36,917 --> 00:56:37,750 Yeah? 1310 00:56:37,750 --> 00:56:39,273 - [Man] What did you mean when you said 1311 00:56:39,273 --> 00:56:41,222 machine having adversarial models don't generally fool us? 1312 00:56:41,222 --> 00:56:43,054 Because I thought that was part of the point 1313 00:56:43,054 --> 00:56:46,724 that we generally do machine-generated adversarial models 1314 00:56:46,724 --> 00:56:48,990 where just a few pixels change. 1315 00:56:48,990 --> 00:56:51,990 - Oh, so if we look at, for example, 1316 00:56:53,623 --> 00:56:55,070 like this picture of the panda. 1317 00:56:55,070 --> 00:56:56,497 To us it looks like a panda. 1318 00:56:56,497 --> 00:56:59,837 To most machine learning models it looks like a gibbon. 1319 00:56:59,837 --> 00:57:02,830 And so this change isn't interfering with our brains, 1320 00:57:02,830 --> 00:57:04,963 but it fools reliably with lots of different 1321 00:57:04,963 --> 00:57:06,963 machine learning models. 1322 00:57:08,713 --> 00:57:12,836 I saw somebody actually took this image of the perturbation 1323 00:57:12,836 --> 00:57:15,433 out of our paper, and they pasted it 1324 00:57:15,433 --> 00:57:17,396 on their Facebook profile picture 1325 00:57:17,396 --> 00:57:20,551 to see if it could interfere with Facebook recognizing them. 1326 00:57:20,551 --> 00:57:22,713 And they said that it did. 1327 00:57:22,713 --> 00:57:25,956 I don't think that Facebook has a gibbon tag though, 1328 00:57:25,956 --> 00:57:29,644 so we don't know if they managed to 1329 00:57:29,644 --> 00:57:32,811 make it think that they were a gibbon. 1330 00:57:34,138 --> 00:57:35,977 And one of the other things that you can do 1331 00:57:35,977 --> 00:57:39,161 that's of fairly high practical significance 1332 00:57:39,161 --> 00:57:42,238 is you can actually fool malware detectors. 1333 00:57:42,238 --> 00:57:44,201 Catherine Gross at the University of Saarland 1334 00:57:44,201 --> 00:57:45,657 wrote a paper about this. 1335 00:57:45,657 --> 00:57:47,276 And there's starting to be a few others. 1336 00:57:47,276 --> 00:57:50,201 There's a model called MalGAN that actually uses a GAN 1337 00:57:50,201 --> 00:57:54,815 to generate adversarial examples for malware detectors. 1338 00:57:54,815 --> 00:57:57,300 Another thing that matters a lot if you are interested 1339 00:57:57,300 --> 00:57:58,840 in using these attacks in the real world 1340 00:57:58,840 --> 00:58:00,724 and defending against them in the real world 1341 00:58:00,724 --> 00:58:02,956 is that a lot of the time you don't actually 1342 00:58:02,956 --> 00:58:06,057 have access to the digital input to a model. 1343 00:58:06,057 --> 00:58:09,017 If you're interested in the perception system 1344 00:58:09,017 --> 00:58:11,300 for a self-driving car or a robot, 1345 00:58:11,300 --> 00:58:14,116 you probably don't get to actually write to the buffer 1346 00:58:14,116 --> 00:58:15,737 on the robot itself. 1347 00:58:15,737 --> 00:58:18,420 You just get to show the robot objects 1348 00:58:18,420 --> 00:58:20,500 that it can see through a camera lens. 1349 00:58:20,500 --> 00:58:24,445 So my colleague Alexey Kurakin and Samy Bengio and I 1350 00:58:24,445 --> 00:58:27,806 wrote a paper where we studied if we can actually fool 1351 00:58:27,806 --> 00:58:30,313 an object recognition system running on a phone, 1352 00:58:30,313 --> 00:58:33,205 where it perceives the world through a camera. 1353 00:58:33,205 --> 00:58:35,345 Our methodology was really straightforward. 1354 00:58:35,345 --> 00:58:36,894 We just printed out several pictures 1355 00:58:36,894 --> 00:58:38,654 of adversarial examples. 1356 00:58:38,654 --> 00:58:41,988 And we found that the object recognition system 1357 00:58:41,988 --> 00:58:44,430 run by the camera was fooled by them. 1358 00:58:44,430 --> 00:58:46,489 The system on the camera is actually different 1359 00:58:46,489 --> 00:58:47,886 from the model that we used 1360 00:58:47,886 --> 00:58:49,550 to generate the adversarial examples. 1361 00:58:49,550 --> 00:58:53,379 So we're showing not just transfer across 1362 00:58:53,379 --> 00:58:55,826 the changes that happen when you use the camera, 1363 00:58:55,826 --> 00:58:58,009 we're also showing that those transfer across 1364 00:58:58,009 --> 00:59:00,022 the model that you use. 1365 00:59:00,022 --> 00:59:02,692 So the attacker could conceivably fool 1366 00:59:02,692 --> 00:59:05,267 a system that's deployed in a physical agent, 1367 00:59:05,267 --> 00:59:07,950 even if they don't have access to the model on that agent 1368 00:59:07,950 --> 00:59:11,539 and even if they can't interface directly with the agent 1369 00:59:11,539 --> 00:59:13,372 but just subtly modify 1370 00:59:15,566 --> 00:59:19,085 objects that it can see in its environment. 1371 00:59:19,085 --> 00:59:20,183 Yeah? 1372 00:59:20,183 --> 00:59:22,434 - [Man] Why does the, 1373 00:59:22,434 --> 00:59:24,408 for the low quality camera image noise 1374 00:59:24,408 --> 00:59:26,586 not affect the adversarial example? 1375 00:59:26,586 --> 00:59:28,311 Because that's what one would expect. 1376 00:59:28,311 --> 00:59:30,023 - Yeah, so I think a lot of that 1377 00:59:30,023 --> 00:59:34,071 comes back to the maps that I showed earlier. 1378 00:59:34,071 --> 00:59:36,614 If you cross over the boundary into the realm 1379 00:59:36,614 --> 00:59:38,426 of adversarial examples, 1380 00:59:38,426 --> 00:59:40,846 they occupy a pretty wide space 1381 00:59:40,846 --> 00:59:43,348 and they're very densely packed in there. 1382 00:59:43,348 --> 00:59:45,108 So if you jostle around a little bit, 1383 00:59:45,108 --> 00:59:48,590 you're not going to recover from the adversarial attack. 1384 00:59:48,590 --> 00:59:50,628 If the camera noise, somehow or other, 1385 00:59:50,628 --> 00:59:53,966 was aligned with the negative gradient of the cost, 1386 00:59:53,966 --> 00:59:57,383 then the camera could take a gradient descent step downhill 1387 00:59:57,383 --> 01:00:01,407 and rescue you from the uphill step that the adversary took. 1388 01:00:01,407 --> 01:00:03,252 But probably the camera's taking more or less 1389 01:00:03,252 --> 01:00:06,699 something that you could model as a random direction. 1390 01:00:06,699 --> 01:00:09,324 Like clearly when you use the camera more than once 1391 01:00:09,324 --> 01:00:11,902 it's going to do the same thing each time, 1392 01:00:11,902 --> 01:00:15,129 but from the point of view of how that direction 1393 01:00:15,129 --> 01:00:18,868 relates to the image classification problem, 1394 01:00:18,868 --> 01:00:22,281 it's more or less a random variable that you sample once. 1395 01:00:22,281 --> 01:00:25,025 And it seems unlikely to align exactly 1396 01:00:25,025 --> 01:00:28,275 with the normal to this class boundary. 1397 01:00:33,238 --> 01:00:36,762 There's a lot of different defenses that we'd like to build. 1398 01:00:36,762 --> 01:00:39,425 And it's a little bit disappointing 1399 01:00:39,425 --> 01:00:41,265 that I'm mostly here to tell you about attacks. 1400 01:00:41,265 --> 01:00:44,088 I'd like to tell you how to make your systems more robust. 1401 01:00:44,088 --> 01:00:47,332 But basically every attack we've tried 1402 01:00:47,332 --> 01:00:49,192 has failed pretty badly. 1403 01:00:49,192 --> 01:00:52,329 And in fact, even when people have published 1404 01:00:52,329 --> 01:00:54,996 that they successfully defended. 1405 01:00:55,927 --> 01:00:57,833 Well, there's been several papers on arXiv 1406 01:00:57,833 --> 01:00:59,892 over the last several months. 1407 01:00:59,892 --> 01:01:02,873 Nicholas Carlini at Berkeley just released a paper 1408 01:01:02,873 --> 01:01:07,710 where he shows that 10 of those defenses are broken. 1409 01:01:07,710 --> 01:01:09,870 So this is a really, really hard problem. 1410 01:01:09,870 --> 01:01:11,849 You can't just make it go away by using 1411 01:01:11,849 --> 01:01:15,630 traditional regularization techniques. 1412 01:01:15,630 --> 01:01:18,328 Particular, generative models are not enough 1413 01:01:18,328 --> 01:01:19,649 to solve the problem. 1414 01:01:19,649 --> 01:01:21,366 A lot of people say, "Oh the problem that's going on here 1415 01:01:21,366 --> 01:01:22,998 "is you don't know anything about the distribution 1416 01:01:22,998 --> 01:01:25,343 "over the input pixels. 1417 01:01:25,343 --> 01:01:26,577 "If you could just tell 1418 01:01:26,577 --> 01:01:28,164 "whether the input is realistic or not 1419 01:01:28,164 --> 01:01:31,141 "then you'd be able to resist it." 1420 01:01:31,141 --> 01:01:33,469 It turns out that what's going on here is 1421 01:01:33,469 --> 01:01:36,284 what matters more than getting the right distributions 1422 01:01:36,284 --> 01:01:37,566 over the inputs x, 1423 01:01:37,566 --> 01:01:39,305 is getting the right posterior distribution 1424 01:01:39,305 --> 01:01:42,366 over the class of labels y given inputs x. 1425 01:01:42,366 --> 01:01:44,665 So just using a generative model 1426 01:01:44,665 --> 01:01:46,905 is not enough to solve the problem. 1427 01:01:46,905 --> 01:01:49,095 I think a very carefully designed generative model 1428 01:01:49,095 --> 01:01:51,070 could possibly do it. 1429 01:01:51,070 --> 01:01:54,729 Here I show two different modes of a bimodal distribution, 1430 01:01:54,729 --> 01:01:56,446 and we have two different generative models 1431 01:01:56,446 --> 01:01:58,948 that try to capture these modes. 1432 01:01:58,948 --> 01:02:01,348 On the left we have a mixture of two Gaussians. 1433 01:02:01,348 --> 01:02:04,148 On the right we have a mixture of two Laplacians. 1434 01:02:04,148 --> 01:02:06,395 You can not really tell the difference visually 1435 01:02:06,395 --> 01:02:09,506 between the distribution they impose over x, 1436 01:02:09,506 --> 01:02:11,601 and the difference in the likelihood they assign 1437 01:02:11,601 --> 01:02:13,929 to the training data is negligible. 1438 01:02:13,929 --> 01:02:16,158 But the posterior distribution they assign over classes 1439 01:02:16,158 --> 01:02:17,886 is extremely different. 1440 01:02:17,886 --> 01:02:20,488 On the left we get a logistic regression classifier 1441 01:02:20,488 --> 01:02:22,833 that has very high confidence 1442 01:02:22,833 --> 01:02:25,143 out in the tails of the distribution 1443 01:02:25,143 --> 01:02:27,049 where there is never any training data. 1444 01:02:27,049 --> 01:02:29,108 On the right, with the Laplacian distribution, 1445 01:02:29,108 --> 01:02:32,025 we level off to more or less 50-50. 1446 01:02:33,156 --> 01:02:33,989 Yeah? 1447 01:02:33,989 --> 01:02:37,156 [speaker drowned out] 1448 01:02:44,052 --> 01:02:46,666 The issue is that it's a nonstationary distribution. 1449 01:02:46,666 --> 01:02:48,052 So if you train it to recognize 1450 01:02:48,052 --> 01:02:49,834 one kind of adversarial example, 1451 01:02:49,834 --> 01:02:52,170 then it will become vulnerable to another kind 1452 01:02:52,170 --> 01:02:55,871 that's designed to fool its detector. 1453 01:02:55,871 --> 01:02:59,631 That's one of the category of defenses that Nicholas broke 1454 01:02:59,631 --> 01:03:02,631 in his latest paper that he put out. 1455 01:03:04,667 --> 01:03:07,231 So here basically the choice of exactly 1456 01:03:07,231 --> 01:03:09,370 the family of generative model has a big effect 1457 01:03:09,370 --> 01:03:13,537 in whether the posterior becomes deterministic or uniform, 1458 01:03:14,765 --> 01:03:17,348 as the model extrapolates. 1459 01:03:17,348 --> 01:03:21,212 And if we could design a really rich, deep generative model 1460 01:03:21,212 --> 01:03:24,387 that can generate realistic ImageNet images 1461 01:03:24,387 --> 01:03:28,012 and also correctly calculate its posterior distribution, 1462 01:03:28,012 --> 01:03:31,389 then maybe something like this approach could work. 1463 01:03:31,389 --> 01:03:33,072 But at the moment it's really difficult to get 1464 01:03:33,072 --> 01:03:36,029 any of those probabilistic calculations correct. 1465 01:03:36,029 --> 01:03:38,273 And what usually happens is, 1466 01:03:38,273 --> 01:03:40,012 somewhere or other we make an approximation 1467 01:03:40,012 --> 01:03:42,156 that causes the posterior distribution 1468 01:03:42,156 --> 01:03:45,553 to extrapolate very linearly again. 1469 01:03:45,553 --> 01:03:48,476 It's been a difficult engineering challenge 1470 01:03:48,476 --> 01:03:50,135 to build generative models 1471 01:03:50,135 --> 01:03:54,302 that actually capture these distributions accurately. 1472 01:03:55,772 --> 01:03:58,681 The universal approximator theorem tells us that 1473 01:03:58,681 --> 01:04:00,273 whatever shape we would like 1474 01:04:00,273 --> 01:04:02,850 our classification function to have, 1475 01:04:02,850 --> 01:04:04,375 a neural net that's big enough 1476 01:04:04,375 --> 01:04:06,407 ought to be able to represent it. 1477 01:04:06,407 --> 01:04:08,505 It's an open question whether we can train the neural net 1478 01:04:08,505 --> 01:04:09,750 to have that function, 1479 01:04:09,750 --> 01:04:11,622 but we know that we should be able to 1480 01:04:11,622 --> 01:04:13,340 at least give the right shape. 1481 01:04:13,340 --> 01:04:15,188 So so far we've been getting neural nets 1482 01:04:15,188 --> 01:04:18,369 that give us these very linear decision functions, 1483 01:04:18,369 --> 01:04:19,569 and we'd like to get something 1484 01:04:19,569 --> 01:04:21,743 that looks a little bit more like a step function. 1485 01:04:21,743 --> 01:04:25,111 So what if we actually just train on adversarial examples? 1486 01:04:25,111 --> 01:04:27,545 For every input x in the training set, 1487 01:04:27,545 --> 01:04:31,727 we also say we want you to train x plus an attack to map 1488 01:04:31,727 --> 01:04:34,252 to the same class label as the original. 1489 01:04:34,252 --> 01:04:37,187 It turns out that this sort of works. 1490 01:04:37,187 --> 01:04:39,111 You can generally resist 1491 01:04:39,111 --> 01:04:41,388 the same kind of attack that you train on. 1492 01:04:41,388 --> 01:04:43,786 And an important consideration 1493 01:04:43,786 --> 01:04:46,151 is making sure that you could run your attack very quickly 1494 01:04:46,151 --> 01:04:48,508 so that you can train on lots of examples. 1495 01:04:48,508 --> 01:04:51,089 So here the green curve at the very top, 1496 01:04:51,089 --> 01:04:53,466 the one that doesn't really descend much at all, 1497 01:04:53,466 --> 01:04:56,188 that's the test set error on adversarial examples 1498 01:04:56,188 --> 01:04:59,188 if you train on clean examples only. 1499 01:05:00,127 --> 01:05:03,889 The cyan curve that descends more or less diagonally 1500 01:05:03,889 --> 01:05:05,292 through the middle of the plot, 1501 01:05:05,292 --> 01:05:07,889 that's the tester on adversarial examples 1502 01:05:07,889 --> 01:05:10,746 if you train on adversarial examples. 1503 01:05:10,746 --> 01:05:13,649 You can see that it does actually reduce significantly. 1504 01:05:13,649 --> 01:05:16,711 It gets down to a little bit less than 1% error. 1505 01:05:16,711 --> 01:05:20,012 And the important thing to keep in mind here is that 1506 01:05:20,012 --> 01:05:23,524 this is fast gradient sign method adversarial examples. 1507 01:05:23,524 --> 01:05:24,872 It's much harder to resist 1508 01:05:24,872 --> 01:05:27,649 iterative multi-step adversarial examples 1509 01:05:27,649 --> 01:05:29,468 where you run an optimizer for a long time 1510 01:05:29,468 --> 01:05:31,924 searching for a vulnerability. 1511 01:05:31,924 --> 01:05:33,128 And another thing to keep in mind 1512 01:05:33,128 --> 01:05:34,063 is that we're testing on 1513 01:05:34,063 --> 01:05:36,525 the same kind of adversarial examples that we train on. 1514 01:05:36,525 --> 01:05:37,772 It's harder to generalize 1515 01:05:37,772 --> 01:05:42,141 from one optimization algorithm to another. 1516 01:05:42,141 --> 01:05:44,558 By comparison, if you look at 1517 01:05:46,881 --> 01:05:48,727 what happens on clean examples, 1518 01:05:48,727 --> 01:05:50,385 the blue curve shows what happens 1519 01:05:50,385 --> 01:05:53,089 on the clean test set error rate 1520 01:05:53,089 --> 01:05:55,687 if you train only on clean examples. 1521 01:05:55,687 --> 01:05:57,249 The red curve shows what happens 1522 01:05:57,249 --> 01:06:01,260 if you train on both clean and adversarial examples. 1523 01:06:01,260 --> 01:06:02,449 We see that the red curve 1524 01:06:02,449 --> 01:06:04,967 actually drops lower than the blue curve. 1525 01:06:04,967 --> 01:06:07,445 So on this task, training on adversarial examples 1526 01:06:07,445 --> 01:06:10,188 actually helped us to do the original task better. 1527 01:06:10,188 --> 01:06:12,625 This is because in the original task we were overfitting. 1528 01:06:12,625 --> 01:06:15,544 Training on adversarial examples is good regularizer. 1529 01:06:15,544 --> 01:06:18,202 If you're overfitting it can make you overfit less. 1530 01:06:18,202 --> 01:06:21,700 If you're underfitting it'll just make you underfit worse. 1531 01:06:21,700 --> 01:06:24,562 Other kinds of models besides deep neural nets 1532 01:06:24,562 --> 01:06:27,287 don't benefit as much from adversarial training. 1533 01:06:27,287 --> 01:06:29,525 So when we started this whole topic of study 1534 01:06:29,525 --> 01:06:30,764 we thought that deep neural nets 1535 01:06:30,764 --> 01:06:33,338 might be uniquely vulnerable to adversarial examples. 1536 01:06:33,338 --> 01:06:35,084 But it turns out that actually 1537 01:06:35,084 --> 01:06:36,625 they're one of the few models that has 1538 01:06:36,625 --> 01:06:38,916 a clear path to resisting them. 1539 01:06:38,916 --> 01:06:40,957 Linear models are just always going to be linear. 1540 01:06:40,957 --> 01:06:44,204 They don't have much hope of resisting adversarial examples. 1541 01:06:44,204 --> 01:06:46,423 Deep neural nets can be trained to be nonlinear, 1542 01:06:46,423 --> 01:06:50,955 and so it seems like there's a path to a solution for them. 1543 01:06:50,955 --> 01:06:52,261 Even with adversarial training, 1544 01:06:52,261 --> 01:06:55,418 we still find that we aren't able to 1545 01:06:55,418 --> 01:06:57,578 make models where if you optimize the input 1546 01:06:57,578 --> 01:06:59,063 to belong to different classes, 1547 01:06:59,063 --> 01:07:01,129 you get examples in those classes. 1548 01:07:01,129 --> 01:07:04,844 Here I start with a CIFAR-10 truck and I turn it into 1549 01:07:04,844 --> 01:07:07,935 each of the 10 different CIFAR-10 classes. 1550 01:07:07,935 --> 01:07:09,244 Toward the middle of the plot 1551 01:07:09,244 --> 01:07:10,651 you can see that the truck has started 1552 01:07:10,651 --> 01:07:12,201 to look a little bit like a bird. 1553 01:07:12,201 --> 01:07:13,736 But the bird class is the only one 1554 01:07:13,736 --> 01:07:15,897 that we've come anywhere near hitting. 1555 01:07:15,897 --> 01:07:17,404 So even with adversarial training, 1556 01:07:17,404 --> 01:07:21,876 we're still very far from solving this problem. 1557 01:07:21,876 --> 01:07:23,180 When we do adversarial training, 1558 01:07:23,180 --> 01:07:25,500 we rely on having labels for all the examples. 1559 01:07:25,500 --> 01:07:27,340 We have an image that's labeled as a bird. 1560 01:07:27,340 --> 01:07:28,975 We make a perturbation that's designed 1561 01:07:28,975 --> 01:07:30,903 to decrease the probability of the bird class, 1562 01:07:30,903 --> 01:07:32,161 and we train the model 1563 01:07:32,161 --> 01:07:33,863 that the image should still be a bird. 1564 01:07:33,863 --> 01:07:35,483 But what if you don't have labels? 1565 01:07:35,483 --> 01:07:39,299 It turns out that you can actually train without labels. 1566 01:07:39,299 --> 01:07:42,700 You ask the model to predict the label of the first image. 1567 01:07:42,700 --> 01:07:44,298 So if you've trained for a little while 1568 01:07:44,298 --> 01:07:45,697 and your model isn't perfect yet, 1569 01:07:45,697 --> 01:07:47,804 it might say, oh, maybe this is a bird, maybe it's a plane. 1570 01:07:47,804 --> 01:07:49,324 There's some blue sky there, 1571 01:07:49,324 --> 01:07:51,550 I'm not sure which of these two classes it is. 1572 01:07:51,550 --> 01:07:53,714 Then we make an adversarial perturbation 1573 01:07:53,714 --> 01:07:55,759 that's intended to change the guess 1574 01:07:55,759 --> 01:07:58,159 and we just try to make it say, oh this is a truck, 1575 01:07:58,159 --> 01:07:59,357 or something like that. 1576 01:07:59,357 --> 01:08:01,236 It's not whatever you believed it was before. 1577 01:08:01,236 --> 01:08:02,983 You can then train it to say 1578 01:08:02,983 --> 01:08:04,481 that the distribution of our classes 1579 01:08:04,481 --> 01:08:06,557 should still be the same as it was before, 1580 01:08:06,557 --> 01:08:08,343 but this should still be considered 1581 01:08:08,343 --> 01:08:10,600 probably a bird or a plane. 1582 01:08:10,600 --> 01:08:12,752 This technique is called virtual adversarial training, 1583 01:08:12,752 --> 01:08:15,176 and it was invented by Takeru Miyato. 1584 01:08:15,176 --> 01:08:18,524 He was my Intern at Google after he did this work. 1585 01:08:18,524 --> 01:08:22,720 At Google we invited him to come and apply his invention 1586 01:08:22,720 --> 01:08:24,637 to text classification, 1587 01:08:25,783 --> 01:08:29,500 because this ability to learn from unlabeled examples 1588 01:08:29,500 --> 01:08:32,380 makes it possible to do semi-supervised learning 1589 01:08:32,380 --> 01:08:35,921 where you learn from both unlabeled and labeled examples. 1590 01:08:35,921 --> 01:08:38,818 And there's quite a lot of unlabeled text in the world. 1591 01:08:38,818 --> 01:08:41,142 So we were able to bring down the error rate 1592 01:08:41,142 --> 01:08:43,761 on several different text classification tasks 1593 01:08:43,761 --> 01:08:47,804 by using this virtual adversarial training. 1594 01:08:47,804 --> 01:08:49,761 Finally, there's a lot of problems where 1595 01:08:49,761 --> 01:08:52,001 we'd like to use neural nets 1596 01:08:52,001 --> 01:08:54,122 to guide optimization procedures. 1597 01:08:54,122 --> 01:08:57,243 If we want to make a very, very fast car, 1598 01:08:57,243 --> 01:08:59,510 we could imagine a neural net that looks 1599 01:08:59,511 --> 01:09:00,996 at the blueprints for a car 1600 01:09:00,996 --> 01:09:02,743 and predicts how fast it will go. 1601 01:09:02,743 --> 01:09:04,337 If we could then optimize 1602 01:09:04,337 --> 01:09:06,379 with respect to the input of the neural net 1603 01:09:06,380 --> 01:09:07,600 and find the blueprint 1604 01:09:07,600 --> 01:09:09,303 that it predicts would go the fastest, 1605 01:09:09,303 --> 01:09:11,622 we could build an incredibly fast car. 1606 01:09:11,622 --> 01:09:13,473 Unfortunately, what we get right now 1607 01:09:13,474 --> 01:09:14,975 is not a blueprint for a fast car. 1608 01:09:14,975 --> 01:09:16,959 We get an adversarial example that the model 1609 01:09:16,959 --> 01:09:18,912 thinks is going to be very fast. 1610 01:09:18,912 --> 01:09:21,758 If we're able to solve the adversarial example problem, 1611 01:09:21,759 --> 01:09:23,063 we'll be able to solve 1612 01:09:23,063 --> 01:09:25,201 this model-based optimization problem. 1613 01:09:25,201 --> 01:09:27,580 I like to call model-based optimization 1614 01:09:27,580 --> 01:09:29,884 the universal engineering machine. 1615 01:09:29,884 --> 01:09:32,300 If we're able to do model-based optimization, 1616 01:09:32,300 --> 01:09:34,060 we'll be able to write down a function that describes 1617 01:09:34,060 --> 01:09:37,540 a thing that doesn't exist yet but we wish that we had. 1618 01:09:37,540 --> 01:09:39,622 And then gradient descent and neural nets 1619 01:09:39,622 --> 01:09:41,339 will figure out how to build it for us. 1620 01:09:41,340 --> 01:09:44,040 We can use that to design new genes and new molecules 1621 01:09:44,040 --> 01:09:45,420 for medicinal drugs, 1622 01:09:45,420 --> 01:09:46,753 and new circuits 1623 01:09:48,836 --> 01:09:51,857 to make GPUs run faster and things like that. 1624 01:09:51,857 --> 01:09:53,697 So I think overall, solving this problem 1625 01:09:53,697 --> 01:09:58,060 could unlock a lot of potential technological advances. 1626 01:09:58,060 --> 01:10:00,439 In conclusion, attacking machine learning models 1627 01:10:00,439 --> 01:10:01,660 is extremely easy, 1628 01:10:01,660 --> 01:10:03,886 and defending them is extremely difficult. 1629 01:10:03,886 --> 01:10:06,017 If you use adversarial training 1630 01:10:06,017 --> 01:10:07,841 you can get a little bit of a defense, 1631 01:10:07,841 --> 01:10:09,297 but there's still many caveats 1632 01:10:09,297 --> 01:10:11,079 associated with that defense. 1633 01:10:11,079 --> 01:10:13,500 Adversarial training and virtual adversarial training 1634 01:10:13,500 --> 01:10:16,240 also make it possible to regularize your model 1635 01:10:16,240 --> 01:10:18,119 and even learn from unlabeled data 1636 01:10:18,119 --> 01:10:21,031 so you can do better on regular test examples, 1637 01:10:21,031 --> 01:10:23,841 even if you're not concerned about facing an adversary. 1638 01:10:23,841 --> 01:10:26,460 And finally, if we're able to solve all of these problems, 1639 01:10:26,460 --> 01:10:29,757 we'll be able to build a black box model-based optimization 1640 01:10:29,757 --> 01:10:32,620 system that can solve all kinds of engineering problems 1641 01:10:32,620 --> 01:10:35,597 that are holding us back in many different fields. 1642 01:10:35,597 --> 01:10:39,697 I think I have a few minutes left for questions. 1643 01:10:39,697 --> 01:10:42,697 [audience applauds] 1644 01:10:47,631 --> 01:10:50,798 [speaker drowned out] 1645 01:10:57,256 --> 01:10:58,089 Yeah. 1646 01:11:15,218 --> 01:11:16,051 Oh, so, 1647 01:11:16,973 --> 01:11:18,618 there's some determinism 1648 01:11:18,618 --> 01:11:22,493 to the choice of those 50 directions. 1649 01:11:22,493 --> 01:11:23,496 Oh right, yeah. 1650 01:11:23,496 --> 01:11:24,637 So repeating the questions. 1651 01:11:24,637 --> 01:11:26,261 I've said that the same perturbation 1652 01:11:26,261 --> 01:11:27,676 can fool many different models 1653 01:11:27,676 --> 01:11:29,221 or the same perturbation can be applied 1654 01:11:29,221 --> 01:11:31,599 to many different clean examples. 1655 01:11:31,599 --> 01:11:33,162 I've also said that the subspace 1656 01:11:33,162 --> 01:11:37,141 of adversarial perturbations is only about 50 dimensional, 1657 01:11:37,141 --> 01:11:40,938 even if the input dimension is 3,000 dimensional. 1658 01:11:40,938 --> 01:11:43,722 So how is it that these subspaces intersect? 1659 01:11:43,722 --> 01:11:47,402 The reason is that the choice of the subspace directions 1660 01:11:47,402 --> 01:11:49,077 is not completely random. 1661 01:11:49,077 --> 01:11:51,595 It's generally going to be something like 1662 01:11:51,595 --> 01:11:55,525 pointing from one class centroid to another class centroid. 1663 01:11:55,525 --> 01:11:59,692 And if you look at that vector and visualize it as an image, 1664 01:12:00,565 --> 01:12:03,138 it might not be meaningful to a human 1665 01:12:03,138 --> 01:12:04,362 just because humans aren't very good 1666 01:12:04,362 --> 01:12:06,717 at imagining what class centroids look like. 1667 01:12:06,717 --> 01:12:07,946 And we're really bad at imagining 1668 01:12:07,946 --> 01:12:10,140 differences between centroids. 1669 01:12:10,140 --> 01:12:12,553 But there is more or less this systematic effect 1670 01:12:12,553 --> 01:12:14,868 that causes different models to learn 1671 01:12:14,868 --> 01:12:17,000 similar linear functions, 1672 01:12:17,000 --> 01:12:21,167 just because they're trying to solve the same task. 1673 01:12:22,282 --> 01:12:25,449 [speaker drowned out] 1674 01:12:27,386 --> 01:12:29,359 Yeah, so the question is, is it possible to identify 1675 01:12:29,359 --> 01:12:33,573 which layer contributes the most to this issue? 1676 01:12:33,573 --> 01:12:35,656 One thing is that if you, 1677 01:12:36,697 --> 01:12:39,002 the last layer is somewhat important. 1678 01:12:39,002 --> 01:12:42,653 Because, say that you made a feature extractor 1679 01:12:42,653 --> 01:12:45,263 that's completely robust to adversarial perturbations 1680 01:12:45,263 --> 01:12:48,783 and can shrink them to be very, very small, 1681 01:12:48,783 --> 01:12:51,022 and then the last layer is still linear. 1682 01:12:51,022 --> 01:12:53,781 Then it has all the problems that are typically associated 1683 01:12:53,781 --> 01:12:55,364 with linear models. 1684 01:12:57,667 --> 01:13:00,157 And generally you can do adversarial training 1685 01:13:00,157 --> 01:13:02,157 where you perturb all the different layers, 1686 01:13:02,157 --> 01:13:04,042 all the hidden layers as well as the input. 1687 01:13:04,042 --> 01:13:06,379 In this lecture I only described perturbing the input 1688 01:13:06,379 --> 01:13:07,653 because it seems like that's where 1689 01:13:07,653 --> 01:13:09,145 most of the benefit comes from. 1690 01:13:09,145 --> 01:13:11,445 The one thing that you can't do with adversarial training 1691 01:13:11,445 --> 01:13:14,279 is perturb the very last layer before the softmax, 1692 01:13:14,279 --> 01:13:15,946 because that linear layer at the end 1693 01:13:15,946 --> 01:13:18,661 has no way of learning to resist the perturbations. 1694 01:13:18,661 --> 01:13:20,740 Doing adversarial training at that layer 1695 01:13:20,740 --> 01:13:23,410 usually just breaks the whole process. 1696 01:13:23,410 --> 01:13:27,896 But other than that, it seems very problem dependent. 1697 01:13:27,896 --> 01:13:30,741 There's a paper by Sara Sabour and her collaborators 1698 01:13:30,741 --> 01:13:34,238 called Adversarial Manipulation of Deep Representations, 1699 01:13:34,238 --> 01:13:36,536 where they design adversarial examples 1700 01:13:36,536 --> 01:13:41,439 that are intended to fool different layers of the net. 1701 01:13:41,439 --> 01:13:43,225 They report some things about, like, 1702 01:13:43,225 --> 01:13:45,418 how large of a perturbation is needed at the input 1703 01:13:45,418 --> 01:13:47,338 to get different sizes of perturbation 1704 01:13:47,338 --> 01:13:49,061 at different hidden layers. 1705 01:13:49,061 --> 01:13:50,858 I suspect that if you trained the model 1706 01:13:50,858 --> 01:13:52,616 to resist perturbations at one layer, 1707 01:13:52,616 --> 01:13:54,315 then another layer would become more vulnerable 1708 01:13:54,315 --> 01:13:57,398 and it would be like a moving target. 1709 01:14:00,901 --> 01:14:04,068 [speaker drowned out] 1710 01:14:09,775 --> 01:14:10,778 Yes, so the question is, 1711 01:14:10,778 --> 01:14:12,197 how many adversarial examples are needed 1712 01:14:12,197 --> 01:14:15,797 to improve the misclassification rate? 1713 01:14:15,797 --> 01:14:20,200 Some of our plots we include learning curves. 1714 01:14:20,200 --> 01:14:22,157 Or some of our papers we include learning curves, 1715 01:14:22,157 --> 01:14:24,157 so you can actually see, 1716 01:14:25,138 --> 01:14:26,602 like in this one here. 1717 01:14:26,602 --> 01:14:29,874 Every time we do an epoch we've generated the same 1718 01:14:29,874 --> 01:14:31,503 number of adversarial examples 1719 01:14:31,503 --> 01:14:33,525 as there are training examples. 1720 01:14:33,525 --> 01:14:37,701 So every epoch here is 50,000 adversarial examples. 1721 01:14:37,701 --> 01:14:41,056 You can see that adversarial training is a very 1722 01:14:41,056 --> 01:14:43,381 data hungry process. 1723 01:14:43,381 --> 01:14:45,861 You need to make new adversarial examples 1724 01:14:45,861 --> 01:14:47,781 every time you update the weights. 1725 01:14:47,781 --> 01:14:51,112 And they're constantly changing in reaction to 1726 01:14:51,112 --> 01:14:54,862 whatever the model has learned most recently. 1727 01:14:55,861 --> 01:14:59,028 [speaker drowned out] 1728 01:15:07,264 --> 01:15:10,514 Oh, the model-based optimization, yeah. 1729 01:15:11,837 --> 01:15:13,853 Yeah, so the question is just to 1730 01:15:13,853 --> 01:15:16,277 elaborate further on this problem. 1731 01:15:16,277 --> 01:15:20,341 So most of the time that we have a machine learning model, 1732 01:15:20,341 --> 01:15:23,701 it's something like a classifier or a regression model 1733 01:15:23,701 --> 01:15:26,741 where we give it an input from the test set 1734 01:15:26,741 --> 01:15:29,040 and it gives us an output. 1735 01:15:29,040 --> 01:15:31,474 And usually that input is randomly occurring 1736 01:15:31,474 --> 01:15:34,981 and comes from the same distribution as the training set. 1737 01:15:34,981 --> 01:15:37,178 We usually just run the model, get its prediction, 1738 01:15:37,178 --> 01:15:39,435 and then we're done with it. 1739 01:15:39,435 --> 01:15:42,019 Sometimes we have feedback loops, 1740 01:15:42,019 --> 01:15:44,297 like for recommender systems. 1741 01:15:44,297 --> 01:15:47,547 If you work at Netflix and you recommend 1742 01:15:47,547 --> 01:15:50,707 a movie to a viewer, then they're more likely 1743 01:15:50,707 --> 01:15:52,757 to watch that movie and then rate it, 1744 01:15:52,757 --> 01:15:54,661 and then there's going to be more ratings of it 1745 01:15:54,661 --> 01:15:55,658 in your training set 1746 01:15:55,658 --> 01:15:57,440 so you'll recommend it to more people in the future. 1747 01:15:57,440 --> 01:15:58,661 So there's this feedback loop 1748 01:15:58,661 --> 01:16:00,936 from the output of your model to the input. 1749 01:16:00,936 --> 01:16:04,677 Most of the time when we build machine vision systems, 1750 01:16:04,677 --> 01:16:08,522 there's no feedback loop from their output to their input. 1751 01:16:08,522 --> 01:16:09,541 If we imagine a setting 1752 01:16:09,541 --> 01:16:11,440 where we start using an optimization algorithm 1753 01:16:11,440 --> 01:16:15,607 to find inputs that maximize some property of the output, 1754 01:16:17,298 --> 01:16:18,842 like if we have a model that looks 1755 01:16:18,842 --> 01:16:20,602 at the blueprints of a car 1756 01:16:20,602 --> 01:16:24,122 and outputs the expected speed of the car, 1757 01:16:24,122 --> 01:16:27,498 then we could use gradient ascent 1758 01:16:27,498 --> 01:16:29,578 to look for the blueprints that correspond 1759 01:16:29,578 --> 01:16:31,895 to the fastest possible car. 1760 01:16:31,895 --> 01:16:33,674 Or for example if we're designing a medicine, 1761 01:16:33,674 --> 01:16:36,618 we could look for the molecular structure 1762 01:16:36,618 --> 01:16:40,842 that we think is most likely to cure some form of cancer, 1763 01:16:40,842 --> 01:16:42,720 or the least likely to cause 1764 01:16:42,720 --> 01:16:45,976 some kind of liver toxicity effect. 1765 01:16:45,976 --> 01:16:49,162 The problem is that once we start using optimization 1766 01:16:49,162 --> 01:16:50,720 to look for these inputs 1767 01:16:50,720 --> 01:16:53,061 that maximize the output of the model, 1768 01:16:53,061 --> 01:16:56,761 the input is no longer an independent sample 1769 01:16:56,761 --> 01:16:58,202 from the same distribution 1770 01:16:58,202 --> 01:17:00,557 as we used at the training set time. 1771 01:17:00,557 --> 01:17:04,202 The model is now guiding the process 1772 01:17:04,202 --> 01:17:06,218 that generates the data. 1773 01:17:06,218 --> 01:17:10,385 So we end up finding essentially adversarial examples. 1774 01:17:11,246 --> 01:17:13,104 Instead of the model telling us 1775 01:17:13,104 --> 01:17:15,242 how we can improve the input, 1776 01:17:15,242 --> 01:17:16,901 what we usually find in practice 1777 01:17:16,901 --> 01:17:19,720 is that we've got an input that fools the model 1778 01:17:19,720 --> 01:17:23,141 into thinking that the input corresponds to something great. 1779 01:17:23,141 --> 01:17:26,282 So we'd find molecules that are very toxic 1780 01:17:26,282 --> 01:17:28,901 but the model thinks they're very non-toxic. 1781 01:17:28,901 --> 01:17:30,464 Or we'd find cars that are very slow 1782 01:17:30,464 --> 01:17:33,381 but the model thinks are very fast. 1783 01:17:35,621 --> 01:17:38,788 [speaker drowned out] 1784 01:17:54,678 --> 01:17:56,017 Yeah, so the question is, 1785 01:17:56,017 --> 01:17:58,859 here the frog class is boosted by going 1786 01:17:58,859 --> 01:18:01,936 in either the positive or negative adversarial direction. 1787 01:18:01,936 --> 01:18:06,276 And in some of the other slides, like these maps, 1788 01:18:06,276 --> 01:18:09,217 you don't get that effect where subtracting epsilon off 1789 01:18:09,217 --> 01:18:12,097 eventually boosts the adversarial class. 1790 01:18:12,097 --> 01:18:13,819 Part of what's going on is 1791 01:18:13,819 --> 01:18:16,496 I think I'm using larger epsilon here. 1792 01:18:16,496 --> 01:18:18,135 And so you might eventually see that effect 1793 01:18:18,135 --> 01:18:20,038 if I'd made these maps wider. 1794 01:18:20,038 --> 01:18:21,627 I made the maps narrower because 1795 01:18:21,627 --> 01:18:25,034 it's like quadratic time to build a 2D map 1796 01:18:25,034 --> 01:18:29,639 and it's linear time to build a 1D cross section. 1797 01:18:29,639 --> 01:18:33,197 So I just didn't afford the GPU time 1798 01:18:33,197 --> 01:18:35,278 to make the maps quite as wide. 1799 01:18:35,278 --> 01:18:37,009 I also think that this might just be 1800 01:18:37,009 --> 01:18:39,999 a weird effect that happened randomly on this one example. 1801 01:18:39,999 --> 01:18:42,742 It's not something that I remember being used to seeing 1802 01:18:42,742 --> 01:18:43,878 a lot of the time. 1803 01:18:43,878 --> 01:18:45,441 Most things that I observe 1804 01:18:45,441 --> 01:18:47,495 don't happen perfectly consistently. 1805 01:18:47,495 --> 01:18:50,582 But if they happen, like, 80% of the time 1806 01:18:50,582 --> 01:18:52,598 then I'll put them in my slide. 1807 01:18:52,598 --> 01:18:54,823 A lot of what we're doing is trying trying to figure out 1808 01:18:54,823 --> 01:18:56,118 more or less what's going on, 1809 01:18:56,118 --> 01:18:58,641 and so if we find that something happens 80% of the time, 1810 01:18:58,641 --> 01:19:02,198 then I consider it to be the dominant phenomenon 1811 01:19:02,198 --> 01:19:03,934 that we're trying to explain. 1812 01:19:03,934 --> 01:19:06,102 And after we've got a better explanation for that 1813 01:19:06,102 --> 01:19:07,739 then I might start to try to explain 1814 01:19:07,739 --> 01:19:09,276 some of the weirder things that happen, 1815 01:19:09,276 --> 01:19:13,109 like the frog happening with negative epsilon. 1816 01:19:15,415 --> 01:19:18,582 [speaker drowned out] 1817 01:19:22,436 --> 01:19:24,062 I didn't fully understand the question. 1818 01:19:24,062 --> 01:19:28,145 It's about the dimensionality of the adversarial? 1819 01:19:34,484 --> 01:19:35,801 Oh, okay. 1820 01:19:35,801 --> 01:19:37,504 So the question is, how is the dimension 1821 01:19:37,504 --> 01:19:39,243 of the adversarial subspace related 1822 01:19:39,243 --> 01:19:40,827 to the dimension of the input? 1823 01:19:40,827 --> 01:19:44,078 And my answer is somewhat embarrassing, 1824 01:19:44,078 --> 01:19:47,042 which is that we've only run this method on two datasets, 1825 01:19:47,042 --> 01:19:49,926 so we actually don't have a good idea yet. 1826 01:19:49,926 --> 01:19:53,526 But I think it's something interesting to study. 1827 01:19:53,526 --> 01:19:57,104 If I remember correctly, my coauthors open sourced our code. 1828 01:19:57,104 --> 01:19:59,323 So you could probably run it on ImageNet 1829 01:19:59,323 --> 01:20:01,406 without too much trouble. 1830 01:20:02,261 --> 01:20:04,150 My contribution to that paper was in 1831 01:20:04,150 --> 01:20:06,066 the week that I was unemployed 1832 01:20:06,066 --> 01:20:09,417 between working at OpenAI and working at Google, 1833 01:20:09,417 --> 01:20:11,030 so I had access to no GPUS 1834 01:20:11,030 --> 01:20:14,288 and I ran that experiment on my laptop on CPU, 1835 01:20:14,288 --> 01:20:18,455 so it's only really small datasets. [chuckles] 1836 01:20:19,766 --> 01:20:22,933 [speaker drowned out] 1837 01:20:40,233 --> 01:20:44,248 Oh, so the question is, do we end up perturbing 1838 01:20:44,248 --> 01:20:47,695 clean examples to low confidence adversarial examples? 1839 01:20:47,695 --> 01:20:50,633 Yeah, in practice we usually find that 1840 01:20:50,633 --> 01:20:53,843 we can get very high confidence on the output examples. 1841 01:20:53,843 --> 01:20:57,156 One thing in high dimensions that's a little bit unintuitive 1842 01:20:57,156 --> 01:21:00,313 is that just getting the sign right 1843 01:21:00,313 --> 01:21:03,353 on very many of the input pixels 1844 01:21:03,353 --> 01:21:06,516 is enough to get a really strong response. 1845 01:21:06,516 --> 01:21:09,845 So the angle between the weight vector 1846 01:21:09,845 --> 01:21:13,492 matters a lot more than the exact coordinates 1847 01:21:13,492 --> 01:21:15,825 in high dimensional systems. 1848 01:21:18,255 --> 01:21:20,087 Does that make enough sense? 1849 01:21:20,087 --> 01:21:21,004 Yeah, okay. 1850 01:21:21,868 --> 01:21:23,673 - [Man] So we're actually going to [mumbles]. 1851 01:21:23,673 --> 01:21:26,095 So if you guys need to leave, that's fine. 1852 01:21:26,095 --> 01:21:28,175 But let's thank our speaker one more time 1853 01:21:28,175 --> 00:00:00,000 for getting-- [audience applauds]